# Scope of work

1) Import all necessary libraries and modules.  
2) First we need to obtain inforamtion about all available cars from the first main pages.  
3) Create a SQL database and export there gathered info about cars from the main pages.  
4) Then we will call module dealers_cars to acquire links to every dealer's list of cars from all main pages.  
5) Now we will repeat the same procedure as in the 1st step but to the every dealer's cars list.  
6) Add extracted data to a created SQL database

## 1. Imort of libraries and modules

In [1]:
import main_pages
import parsing
import marks
import dataframe
import sql_db
import dealers_cars
import cars_scraper
from datetime import datetime

## 2. Acquring cars' information from the first main pages

Here we call the 1st module 'parsing' to scrap data from the main pages of the website. Then we store all gathered information about cars into corresponding lists.

In [2]:
# Link to the main webpage
url = 'https://www.autoscout24.com/lst?atype=C&desc=0&sort=standard&source=homepage_search-mask&ustate=N%2CU'

The module **main_pages** collects all URLs of the main pages from the website autoscout24.com

In [3]:
all_pages = main_pages.pages_urls(url)

With a help of **parsing** module we scrap info about cars from main pages (20).

In [4]:
start = datetime.now()
cars, characteristics, prices, locations = parsing.cars_info(all_pages)
end = datetime.now()
print('Total time :', end-start)

Total time : 0:00:22.003014


Here we call the module **marks** in order to extract all existing car marks from the website. Afterwards we will replace
spaces in marks' names into dashes '-'

In [5]:
marks_menu = marks.all_marks(url)

Here we call the module **dataframe** in order to gather all info about cars into one dataframe

In [6]:
df = dataframe.df_construct(marks_menu, cars, characteristics, prices, locations)

In [7]:
df

Unnamed: 0,mark,model,mileage,transmission,registration,fuel,power,location,price
0,Volkswagen,T6 Transporter,205800,Manual,11/2017,Diesel,150,DE,22900
1,Land-Rover,Range Rover Evoque,174000,Automatic,04/2012,Diesel,150,IT,15500
2,Mercedes-Benz,S 580,3084,Automatic,05/2022,Gasoline,503,DE,195890
3,Polestar,2,8836,Automatic,12/2021,Electric,231,BE,45890
4,BMW,M850,47318,Automatic,12/2018,Gasoline,530,DE,64499
...,...,...,...,...,...,...,...,...,...
394,BMW,320,109125,Manual,04/1981,Gasoline,122,DE,15890
395,DS-Automobiles,DS 4,66000,Manual,06/2017,Gasoline,131,BE,11750
396,Volkswagen,Polo,143000,Manual,12/2014,Gasoline,90,BE,7490
397,Ford,Mondeo,120502,Automatic,02/2016,Diesel,179,BE,14599


## 3. Creating a SQL database and exporting parsed data there from the main pages

Here we connect to another module **sql_db**. This module connects to a PostgreSQL database *autoscout*. In this database there is a schema *autoscout* which contains the main table *cars*.

In [8]:
sql_db.connect(df,'replace')

## 4. Acquiring links to every dealer's list of cars from all main pages.

The module **dealers_cars.sel_pars(all_pages)** retruns a list. We want to collect all href links to dealers' cars from all main pages (20). To do so we will create a dictionary. Each key is a link to one of the main pages and each value is a list of all href links to each car dealer from this page.

In [None]:
#This loop below takes around 40 minutes to generate an entire dictionary

In [None]:
start = datetime.now()

dealers_cars_dict = {}
for page in all_pages:
    dealers_cars_dict[page] = (dealers_cars.sel_pars(page))
    
end = datetime.now()
print('Total time :', end-start)

The generated dictionary can contain the same URL links (refernce to the same car dealer). We want to keep only unique links. So we transform lists in the **dealers_cars_dict** into sets and then back to lists for a convenient work. 

In [None]:
for key, value in dealers_cars_dict.items():
    dealers_cars_dict[key] = list(set(value))

## 5. Acquring cars' information from the all the car dealers and storing into a SQL database

In [None]:
#import ast

In [None]:
#with open('dict.txt', 'r') as file:
#    file_content = file.read()
#dealers_cars_dict = ast.literal_eval(file_content)

In [None]:
#dealers_cars_dict

In [None]:
#The code below runs about 40 minutes

In [None]:
#This loop iterates over each href link in the dictionary dealers_cars_dict. Then it scraps information about every car
#from each cars dealer and forms it into a dataframe. Afterwards it uploads each dataframe to a SQL database. The entire 
#dataset is saved into a new table in a database with a current date in the name. 
start = datetime.now()

for key, value_list in dealers_cars_dict.items():
    for value in value_list:
        all_pgs = main_pages.pages_urls(value)
        cars0, characteristics0, prices0, locations0 = parsing.cars_info(all_pgs)
        df0 = dataframe.df_construct(marks_menu, cars0, characteristics0, prices0, locations0)
        sql_db.connect(df0,'append') #here we append freshly formed dataframe to the SQL database
        
end = datetime.now()
print('Total time :', end-start)

Total time : 0:37:47.408648

In [None]:
start = datetime.now()

for page in all_pages:
    cars_scraper.parser(page, marks_menu)

end = datetime.now()
print('Total time :', end-start)

Total time : 1:28:46.690304