<h1 style = "text-align: center; ">What influences a VW Golf price?</h1>
<h2 style = "text-align: center; ">ST445 - Managing and Visualizing Data</h2>
<h3 style = "text-align: center; ">Candidate IDs: 38682, 50450, 44051</h3>


##### Overview
##### I. Notebook preparation
##### II. Introduction: Research Question + Data Choice
##### III. Data acquisition
        a. UK used cars data (Kaggle API) + EDA (Plot Car Makes over years and price average) (Plot VW Golf)
        b. Economic factors (Webscraping)
        c. Environmental factors (manual PDF extraction)
##### IV. Data Visualization
        a. Price vs. Car Data (Heat Map, Correlation Matrix)
        b. Price vs. Economic Factors (??)
        c. Price vs. Environmental Factors (??)
        d. choose most influential factors (max 3)
##### V. Data Modeling
        a. Model Linear Regression on 3 best variables
##### VI. Conclusion

![Volkswagen Golf](https://m.atcdn.co.uk/vms/media/w980/2fa3b55ab44d4744969f968b5727c8d2.jpg)

### I. Notebook preparation (maybe this section is not needed)

Perhaps we include something similar to this example from "Example 2"

[[Before running this notebook, please make sure you have all necessary modules installed in your environment. Potentially less common modules used include:

google.cloud
dotenv
networkx
geopandas
praw
transformers
plotly.graph_objects
ipywidgets
folium
As usual, they can be installed by running the command pip install [module] in the terminal.

Furthermore, please make sure your Python version is compatible with all the modules. While writing this, it became apparent there might be some compatibility issues with newer Python versions (especially 3.11 and newer). In case you run into any issues, it might be worth trying to run the code with an older version such as Python 3.9.]]

Our complete GitHub repository can be found at the following location: https://github.com/lse-st445/2024-project-data-knows-ball [[Should we put this in the title of our paper??]]

In [166]:
# Import relevant packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import numpy as np
import os
import zipfile
import sqlite3
from matplotlib import pyplot as plt
import seaborn as sns

# Install lxml with conda install anaconda::lxml to use HMTL and XML with Python
# conda install openpyxl

After importing all necessary libararies, we set the standard settings for the notebook regarding the plot sizes of visualiasing our data.

In [158]:
# Set the global default size of matplotlib figures
plt.rc('figure', figsize=(10, 5))

# Size of matplotlib figures that contain subplots
fizsize_with_subplots = (10, 5)

# Size of matplotlib histogram bins
bin_size = 10

### II. Introduction and data description

[[Describe our data sets and pose our research question]]

[[Maybe include data dictionaries of some sort similar to Table 1.3.1 and Table 1.3.2 in "Example 2"]]

### III. Data acquisition

#### III.a. UK 100k Cars Data

##### III.a.i. Kaggle API

In the first part of our data acquisition, we are focusing on gathering as much used car data as possible for understanding the UK used cars landscape. After that, we can further focus on the VW Golf data as an example car which is one of the most used cars in the UK. 

To acquire UK used cars data, the choice was to work with Kaggle since several datasets for UK used cars are available on the platform which are ready to be worked with. In order to work with the data, we are using the Kaggle API to access the data via a download. This initiated to download the relevant .csv files for our work with the UK used cars landscape. 

In [81]:
# Use Kaggle API to access the relevant dataset
import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_files("adityadesai13/used-car-dataset-ford-and-mercedes", path=".", unzip=True)

Dataset URL: https://www.kaggle.com/datasets/adityadesai13/used-car-dataset-ford-and-mercedes


In the following the data gathered from the Kaggle dataset was uploaded into the github repository from where we are loading it into sepearte dataframes. Ultimately all cars data, which are seperated into the makes by .csv files, are merged alltogether into one lareg dataframe allowing us to perform EDA on the UK used cars landscape. 

In [142]:
# Load retrieved data into Dataframes and modify
uk_cars_audi = pd.read_csv("audi.csv")
uk_cars_audi["Make"] = "Audi"
uk_cars_bmw = pd.read_csv("bmw.csv") 
uk_cars_bmw["Make"] = "BMW"
uk_cars_ford = pd.read_csv("ford.csv")
uk_cars_ford["Make"] = "Ford"
uk_cars_hyundai = pd.read_csv("hyundi.csv")
uk_cars_hyundai["Make"] = "Hyundai"
uk_cars_mercedes = pd.read_csv("skoda.csv")
uk_cars_mercedes["Make"] = "Mercedes"
uk_cars_skoda = pd.read_csv("skoda.csv")
uk_cars_skoda["Make"] = "Skoda"
uk_cars_toyota = pd.read_csv("toyota.csv")
uk_cars_toyota["Make"] = "Toyota"
uk_cars_vauxhall = pd.read_csv("vauxhall.csv")
uk_cars_vauxhall["Make"] = "Vauxhall"
uk_cars_vw = pd.read_csv("vw.csv")
uk_cars_vw["Make"] = "VW"

# Merge to one DataFrame
uk_cars_make = [uk_cars_audi, uk_cars_bmw, uk_cars_ford, 
                uk_cars_hyundai, uk_cars_mercedes, uk_cars_skoda, 
                uk_cars_toyota, uk_cars_vauxhall, uk_cars_vw]
uk_cars_data = pd.concat(uk_cars_make)

In [85]:
display(uk_cars_data)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,Make,tax(£)
0,A1,2017,12500,Manual,15735,Petrol,150.0,55.4,1.4,Audi,
1,A6,2016,16500,Automatic,36203,Diesel,20.0,64.2,2.0,Audi,
2,A1,2016,11000,Manual,29946,Petrol,30.0,55.4,1.4,Audi,
3,A4,2017,16800,Automatic,25952,Diesel,145.0,67.3,2.0,Audi,
4,A3,2019,17300,Manual,1998,Petrol,145.0,49.6,1.0,Audi,
...,...,...,...,...,...,...,...,...,...,...,...
15152,Eos,2012,5990,Manual,74000,Diesel,125.0,58.9,2.0,VW,
15153,Fox,2008,1799,Manual,88102,Petrol,145.0,46.3,1.2,VW,
15154,Fox,2009,1590,Manual,70000,Petrol,200.0,42.0,1.4,VW,
15155,Fox,2006,1250,Manual,82704,Petrol,150.0,46.3,1.2,VW,


We do have to make some modifications and clean up to the make main uk_cars_data dataframe ready to be worked with. 
1. We need to reset the indexes, since we have 92.335 cars and not ca. 15k (this is due to merging separate dataframes and their indexes)
2. Some cars' road taxes are stated seperately in the column "tax (£)", altough all .csv files are formatted with the "tax" column. Thus we need to merge both, since the original "tax" column is stated in GBP £
3. We want to add a column to know the respective Make in the overview dataframe uk_cars_data (and moving it to the first column of the dataframe)
4. We need a standardized price index for price/mileage to be able to compare the cars with different mileages

In [144]:
# 1. Modifying the indexes of the dataframe so we can work with it
number_of_cars = len(uk_cars_data)
new_indexes = list(range(0, number_of_cars))
uk_cars_data = uk_cars_data.reset_index(drop=True)
uk_cars_data = uk_cars_data.reindex(index=new_indexes)

# 2. Adjusting the column Tax(£)
if 'tax' not in uk_cars_data.columns:
    uk_cars_data['tax'] = None
uk_cars_data['tax'] = uk_cars_data['tax'].combine_first(uk_cars_data['tax(£)'])
uk_cars_data = uk_cars_data.drop(columns='tax(£)')

# 3. Make of the cars to the beginning of the dataframe
columns_order = ["Make"] + [col for col in uk_cars_data.columns if col != "Make"]
uk_cars_data = uk_cars_data[columns_order]

# 4. Add another column with the comparison index price/mileage
uk_cars_data["price/mileage"] = uk_cars_data["price"]/uk_cars_data["mileage"]

In [146]:
display(uk_cars_data)

Unnamed: 0,Make,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,price/mileage
0,Audi,A1,2017,12500,Manual,15735,Petrol,150.0,55.4,1.4,0.794407
1,Audi,A6,2016,16500,Automatic,36203,Diesel,20.0,64.2,2.0,0.455763
2,Audi,A1,2016,11000,Manual,29946,Petrol,30.0,55.4,1.4,0.367328
3,Audi,A4,2017,16800,Automatic,25952,Diesel,145.0,67.3,2.0,0.647349
4,Audi,A3,2019,17300,Manual,1998,Petrol,145.0,49.6,1.0,8.658659
...,...,...,...,...,...,...,...,...,...,...,...
92330,VW,Eos,2012,5990,Manual,74000,Diesel,125.0,58.9,2.0,0.080946
92331,VW,Fox,2008,1799,Manual,88102,Petrol,145.0,46.3,1.2,0.020420
92332,VW,Fox,2009,1590,Manual,70000,Petrol,200.0,42.0,1.4,0.022714
92333,VW,Fox,2006,1250,Manual,82704,Petrol,150.0,46.3,1.2,0.015114


##### III.a.ii. EDA of UK used cars data
Here we are performing explanatory data analysis to understand the UK used cars landscape better by looking into the dataframe more closely.

In [129]:
# Generate Overview of the main dataframe
uk_cars_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92335 entries, 0 to 92334
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Make          92335 non-null  object 
 1   model         92335 non-null  object 
 2   year          92335 non-null  int64  
 3   price         92335 non-null  int64  
 4   transmission  92335 non-null  object 
 5   mileage       92335 non-null  int64  
 6   fuelType      92335 non-null  object 
 7   tax           92335 non-null  float64
 8   mpg           92335 non-null  float64
 9   engineSize    92335 non-null  float64
dtypes: float64(3), int64(3), object(4)
memory usage: 7.7+ MB


In [131]:
# Understand how many values we have per category
unique_counts = uk_cars_data.nunique()
print(unique_counts)

Make                9
model             168
year               27
price           11271
transmission        4
mileage         39387
fuelType            5
tax                47
mpg               187
engineSize         32
dtype: int64


In [133]:
# Focus: Unique values for fuelType and transmission
print("Fuel Types:", uk_cars_data['fuelType'].unique())
print("Transmissions:", uk_cars_data['transmission'].unique())

Fuel Types: ['Petrol' 'Diesel' 'Hybrid' 'Other' 'Electric']
Transmissions: ['Manual' 'Automatic' 'Semi-Auto' 'Other']


In [None]:
#plt.plot(uk_cars_data["price/mileage"])
sns.displot(uk_cars_data["price/mileage"])

#### III.b. UK Office of National Statistics (ONS)

##### III.b.i Webscrapping: Unemployment rate and CPIH (time series economic data)

In [21]:
# Write function for webscrapping data from the UK Office of National Statistics
def webscrape_ONS(url):
    '''
    This function webscrapes various tables from the UK ONS and seperates the data 
    into distinct dataframes based on the given periodicity: year, quarter, or month.
    ----------
    Args:
        url: The UK Office of National Statistics url from which to webscrabe the table
    ----------
    Returns:        
        ons_year_df: Dataframe of UK ONS data at the yearly level
        ons_quarter_df: Dataframe of UK ONS data at the quarterly level
        ons_month_df: Dataframe of UK ONS data at the monthly level
    '''
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "lxml")

    # Save the table headers to later set as column names for the dataframes
    table_headers = soup.find_all("th")
    table_headers = table_headers[0:2] # We only need the first two columns of data from the ONS
    table_headers = [t.text for t in table_headers]

    ons_data = []

    # Identify and append all webscrapped rows of the ONS table into a dataframe
    for i, row in enumerate(soup.find_all("tr")[2:]): # The frist two rows of ONS tables are headers
        try:
            period, value = row.find_all("td")[0:2] # We only need the first two columns of data from the ONS
            ons_data.append([period.text, value.text])
        except:
            print("Error parsing row #{}".format(i))

    ons_df = pd.DataFrame(ons_data, columns = table_headers)

    # Make the "Value" column data type float instead of string as it was webscrapped
    ons_df = ons_df.astype({"Value": float})

    # Split the data into separate dataframes based on periodicity (year/quarter/month)
    ons_year_df = ons_df[ons_df["Period"].str.len() == 4].reset_index(drop = True) # Year periods will have 4 characters (e.g., "2020")
    ons_quarter_df = ons_df[ons_df["Period"].str.len() == 7].reset_index(drop = True) # Quarter periods will have 7 characters (e.g., "2020 Q1")
    ons_month_df = ons_df[ons_df["Period"].str.len() == 8].reset_index(drop = True) # Month periods will have 8 characters (e.g., "2020 JAN")

    # For dataframes at the yearly level, make year an int type instead of string as it was webscrapped
    ons_year_df = ons_year_df.astype({"Period": int})
    
    # Ensure that all rows present in the original ONS table are present in the three dataframes split based on periodicity
    split_df_len = sum([len(ons_year_df), len(ons_quarter_df), len(ons_month_df)])
    orig_df_len = len(ons_data)
    assert split_df_len == orig_df_len, "ERROR: Not all rows from original ONS table present in corresponding year/quarter/month dataframes"

    return ons_year_df, ons_quarter_df, ons_month_df


In [23]:
# Webscrape UK unemployment and CPIH data tables from the ONS
url_uk_unemp = "https://www.ons.gov.uk/employmentandlabourmarket/peoplenotinwork/unemployment/timeseries/mgsx/lms"
url_uk_cpih = "https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l55o/mm23"

uk_unemp_year_df, uk_unemp_quarter_df, uk_unemp_month_df = webscrape_ONS(url_uk_unemp)
uk_cpih_year_df, uk_cpih_quarter_df, uk_cpih_month_df = webscrape_ONS(url_uk_cpih)


##### III.b.ii Webscraping: Gross Disposable Household Income (geographic economic data)

In [25]:
# Webscrape data from the UK Office of National Statistics -- Gross Disposable Household Income (GDHI)
url_uk_gdhi = "https://www.ons.gov.uk/economy/regionalaccounts/grossdisposablehouseholdincome/bulletins/regionalgrossdisposablehouseholdincomegdhi/1997to2022"

page = requests.get(url_uk_gdhi)
soup = BeautifulSoup(page.content, "lxml")

# Save the table headers to later set as column names for the dataframes
table_headers = soup.find_all("th")
table_headers = [t.text for t in table_headers]
uk_countries_regions_df = pd.DataFrame(table_headers[8:22]) # The data of the 1st column ("Counties and regions of the UK") is defined as <th> as opposed to <td> and will be combined with rest of data later
table_headers = table_headers[0:5] # We only need the first five columns of data from the ONS


In [27]:
gdhi_data = []

# Identify and append all webscrapped rows of the ONS table into a dataframe
# The ONS table of interest is: Table 1: Gross disposable household income by UK and constituent countries and regions, UK, 2022
# NOTE: 2022 is the most recent year available for ONS data on this topic, this statistical bulletin was released on September 4, 2024
for i, row in enumerate(soup.find_all("tr")[1:15]): # The first row of the ONS tables is headers; Table 1 contains 14 rows
    try:
        pop, gdhi, gdhi_growth, gdhi_index = row.find_all("td")[0:4] # We only need the first four columns of data from the ONS as the 1st column ("Counties and regions of the UK") is defined as <th> as opposed to <td> and will be combined later
        gdhi_data.append([pop.text, gdhi.text, gdhi_growth.text, gdhi_index.text])
    except:
        print("Error parsing row #{}".format(i))

partial_df = pd.DataFrame(gdhi_data)


In [29]:
# Combine the UK countries and regions with the rest of the GDHI data
uk_gdhi_df = pd.concat([uk_countries_regions_df, partial_df], axis = 1)
uk_gdhi_df.columns = table_headers

# Clean the UK GDHI data
uk_gdhi_df.rename({"Countriesandregions of the UK": "Countries and regions of the UK",
                   "Population(million)": "Population (million)",
                   "GDHI perhead (£)": "GDHI per head (£)"},
                  axis = "columns",
                  inplace = True)
uk_gdhi_df["GDHI per head (£)"] = uk_gdhi_df["GDHI per head (£)"].str.replace(",", "")
uk_gdhi_df = uk_gdhi_df.astype({"Population (million)": float, "GDHI per head (£)": int, "Growth in GDHI per head (%)": float, "GDHI per head index (UK=100)": float})
uk_gdhi_df.replace("NorthernIreland", "Northern Ireland", inplace = True)


##### III.b.iii Importing XLSX: Median gross weekly earnings (geographic economic data)

In [17]:
# Import XLSX from the UK Office of National Statistics -- Figure 6: Median gross weekly earnings for full-time employees for all local authorities by place of work 
# NOTE: This XLSX is provided via download from the following ONS statistical bulletin, "Employee earnings in the UK: 2024"
    # https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings/2024
ons_weekly_earnings_df = pd.read_excel("ONS_figure6.xlsx", skiprows = 8)


In [19]:
# Clean the UK weekly earnings data
earnings = ons_weekly_earnings_df["Earnings"]
earnings_numeric = pd.to_numeric(earnings, errors = "coerce")
ons_without_earnings = ons_weekly_earnings_df.drop(columns = "Earnings")

uk_weekly_earnings_df = pd.concat([ons_without_earnings, earnings_numeric], axis = 1)
uk_weekly_earnings_df.dropna(inplace = True, ignore_index = True)
uk_weekly_earnings_df["Local authority name"] = uk_weekly_earnings_df["Local authority name"].str.strip()


#### III. (to be deleted?) US Bureau of Labor Statistics (BLS) API

In [15]:
# Call the BLS API for the following two datasets between the years of 2019-2024
    # LNS14000000: The USA U-3 unemployment rate ("official rate"); civilian unemployment rate for 16 years and older, seasonally adjusted
        # https://data.bls.gov/timeseries/LNS14000000
    # CUUR0000SA0: CPI - All items in U.S. city average, all urban consumers, not seasonally adjusted
        # https://data.bls.gov/timeseries/CUUR0000SA0
# We are interested in retaining the last 5 years (2020-2024) of data. The reason for including 2019, is so that we can compute the YoY CPI percentage change
# NOTE: because we have not registered we can only querry this API 25 times per day
headers = {'Content-type': 'application/json'}
data = json.dumps({"seriesid": ['LNS14000000', 'CUUR0000SA0'], "startyear": "2019", "endyear": "2024"})
p = requests.post('https://api.bls.gov/publicAPI/v1/timeseries/data/', data = data, headers = headers)
json_data = json.loads(p.text)


In [11]:
# Save the BLS unemployment and CPI data within a dataframe
bls_data = []

# NOTE: this code is heavily derived from the BLS API Version 1.0 Python Sample Code
    # https://www.bls.gov/developers/api_python.htm
for series in json_data['Results']['series']:
    seriesId = series['seriesID']
    for item in series['data']:
        year = item['year']
        period = item['period']
        value = item['value']

        if 'M01' <= period <= 'M12':
            bls_data.append([seriesId,year,period,value])

bls_df = pd.DataFrame(bls_data, columns = ["SeriesID", "Year", "Period", "Value"])


In [12]:
# Separate BLS table into unemployment and CPI dataframes
unemp_df = bls_df[bls_df["SeriesID"] == "LNS14000000"].reset_index(drop = True)
cpi_df = bls_df[bls_df["SeriesID"] == "CUUR0000SA0"].reset_index(drop = True)

# Ensure that all rows present in the original BLS table are present in the two split unemployment and CPI dataframes
split_df_len = sum([len(unemp_df), len(cpi_df)])
orig_df_len = len(bls_df)
assert split_df_len == orig_df_len, "ERROR: Not all rows from original BLS table present in corresponding unemployment and CPI dataframes"

# Clean the BLS unemployment data
unemp_df.drop("SeriesID", axis = 1, inplace = True)
unemp_df = unemp_df.astype({"Year": int, "Value": float})

# Clean the BLS CPI data
cpi_df.drop("SeriesID", axis = 1, inplace = True)
cpi_df = cpi_df.astype({"Year": int, "Value": float})


In [13]:
# We are only interested in the previous 5 years of US BLS unemployment data (this is at the monthly level)
us_unemp_month_df = unemp_df[unemp_df["Year"] > 2019].reset_index(drop = True)


In [14]:
# Restructure BLS CPI data to compute year-over-year values
cpi_df_wide = cpi_df.pivot(index = "Year", columns = "Period", values = "Value").pct_change(fill_method = None).reset_index()
cpi_df_long = pd.melt(cpi_df_wide, id_vars = ["Year"], value_vars = ["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"])

# We are only interested in the previous 5 years of US BLS CPI data (this is at the monthly level)
us_cpi_month_df = cpi_df_long.dropna().reset_index(drop = True)
us_cpi_month_df["Value"] = us_cpi_month_df["value"] * 100
us_cpi_month_df.drop("value", axis = 1, inplace = True)
us_cpi_month_df = us_cpi_month_df.sort_values(["Year", "Period"], ascending = False).reset_index(drop = True)


#### Current datasets that we have:

UK national unemployment data at the year (1971-2023), quarter (1971Q1-2023Q3), and month (1971FEB-2024SEP) level. <br>
Webscrapped from UK Office of National Statistics: https://www.ons.gov.uk/employmentandlabourmarket/peoplenotinwork/unemployment/timeseries/mgsx/lms <br>
**uk_unemp_year_df**, **uk_unemp_quarter_df**, **uk_unemp_month_df**

UK national CPIH data at the year (1989-2024), quarter (1989Q1-2024Q4), and month (1989JAN-2024DEC) level. <br>
Webscrapped from UK Office of National Statistics: https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l55o/mm23 <br>
**uk_cpih_year_df**, **uk_cpih_quarter_df**, **uk_cpih_month_df**

UK and constituent countries and regions 2022 Gross Disposable Household Income (GDHI) by country/region. <br>
Webscrapped data from the UK Office of National Statistics: https://www.ons.gov.uk/economy/regionalaccounts/grossdisposablehouseholdincome/bulletins/regionalgrossdisposablehouseholdincomegdhi/1997to2022 <br>
**uk_gdhi_df**

Great Britain, April 2024, median gross weekly earnings for full-time employees for all local authorities by place of work. <br>
Import XLSX associated with "Figure 6: Median gross weekly earnings for full-time employees for all local authorities by place of work" provided for download by the UK Office of National Statistics: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings/2024 <br>
**uk_weekly_earnings_df**

USA monthly (January 2020 - December 2024) U-3 unemployment rate ("official rate"); civilian unemployment rate for 16 years and older, seasonally adjusted. <br>
Called from the US Bureau of Labor Statistics (BLS) API for the following timeseries (LNS14000000):
https://data.bls.gov/timeseries/LNS14000000 <br>
**us_unemp_month_df**

USA monthly (January 2020 - December 2024) year-over-year percentage change for CPI - All items in U.S. city average, all urban consumers, not seasonally adjusted. <br>
Called from the US Bureau of Labor Statistics (BLS) API for the following timeseries (CUUR0000SA0):
https://data.bls.gov/timeseries/CUUR0000SA0 <br>
**us_cpi_month_df**


[[Description of what visualizations we decided to include and why]]

### III. c. Environmental factors: Volkswagen Golf CO2 Emissions Data

In [None]:
# Scraping from https://www.cars-data.com/en/volkswagen-golf/co2-emissions

In [78]:
# Code for visualizations

vw = pd.read_csv("vw.csv")
vw['model'] = vw['model'].str.strip()
golf = vw[vw['model'] == 'Golf']

display(golf)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
733,Golf,2019,17000,Manual,8000,Diesel,145,57.7,1.6
734,Golf,2019,36000,Automatic,4000,Petrol,145,32.8,2.0
735,Golf,2015,19390,Automatic,20031,Petrol,200,40.4,2.0
736,Golf,2019,16290,Automatic,14821,Petrol,145,44.8,1.0
737,Golf,2017,16491,Automatic,20693,Petrol,20,60.1,1.4
...,...,...,...,...,...,...,...,...,...
5591,Golf,2015,11750,Manual,79000,Diesel,20,67.3,2.0
5592,Golf,2016,11950,Automatic,41725,Petrol,30,53.3,1.4
5593,Golf,2017,12950,Automatic,44837,Diesel,20,67.3,2.0
5594,Golf,2014,11299,Manual,25495,Petrol,30,53.3,1.4


In [103]:

vw['fuelType'] = vw['fuelType'].str.strip()
vw['transmission'] = vw['transmission'].str.strip()



def make_setup(row):
    return str(row['year']) + " " + row['transmission'] + " " + row['fuelType'] + " " + str(row['engineSize'])
#golf['setup'] = str(golf['year']) + row['fuelType'].str.strip() + str(row['engineSize'].str.strip())

# Apply the custom function row-wise to concatenate
golf['setup'] = golf.apply(make_setup, axis=1)


In [110]:
# sort df by year
golf = golf.sort_values(by='year')

# just want 2015-2020
golf = golf[golf['year'] > 2014]

# dropping 'Other' and 'Hybrid' fuel types
golf['fuelType'] = golf['fuelType'].str.strip()
golf = golf[golf['fuelType'] != 'Other']
golf = golf[golf['fuelType'] != 'Hybrid']


golf = golf[golf['engineSize'] != 0.0]

# dropping semi-auto transmission
golf = golf[golf['transmission'] != 'Semi-Auto']


display(golf)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,setup
1037,Golf,2015,18498,Manual,25102,Petrol,200,39.8,2.0,2015 Manual Petrol 2.0
4439,Golf,2015,11490,Manual,58000,Diesel,20,67.3,2.0,2015 Manual Diesel 2.0
2361,Golf,2015,11133,Manual,31722,Petrol,20,58.9,1.4,2015 Manual Petrol 1.4
5403,Golf,2015,10680,Manual,38535,Petrol,30,53.3,1.4,2015 Manual Petrol 1.4
1602,Golf,2015,11495,Manual,20012,Petrol,30,53.3,1.4,2015 Manual Petrol 1.4
...,...,...,...,...,...,...,...,...,...,...
4801,Golf,2020,20396,Manual,1555,Diesel,145,57.7,1.6,2020 Manual Diesel 1.6
4567,Golf,2020,27989,Automatic,8256,Petrol,145,37.7,2.0,2020 Automatic Petrol 2.0
4985,Golf,2020,28640,Automatic,100,Petrol,145,47.1,1.5,2020 Automatic Petrol 1.5
5290,Golf,2020,25000,Manual,1000,Petrol,150,49.6,1.5,2020 Manual Petrol 1.5


In [115]:
setups = golf['setup'].unique()
len(setups)

setups

array(['2015 Manual Petrol 2.0', '2015 Manual Diesel 2.0',
       '2015 Manual Petrol 1.4', '2015 Manual Diesel 1.6',
       '2015 Automatic Diesel 1.6', '2015 Automatic Diesel 2.0',
       '2015 Automatic Petrol 2.0', '2015 Automatic Petrol 1.4',
       '2015 Manual Petrol 1.2', '2016 Automatic Diesel 1.6',
       '2016 Manual Petrol 1.4', '2016 Manual Diesel 2.0',
       '2016 Manual Diesel 1.6', '2016 Automatic Petrol 2.0',
       '2016 Manual Petrol 2.0', '2016 Automatic Diesel 2.0',
       '2016 Automatic Petrol 1.4', '2016 Manual Petrol 1.0',
       '2016 Automatic Petrol 1.0', '2016 Manual Petrol 1.2',
       '2017 Manual Petrol 1.0', '2017 Manual Diesel 1.6',
       '2017 Manual Petrol 1.4', '2017 Manual Petrol 1.5',
       '2017 Manual Petrol 2.0', '2017 Automatic Petrol 1.4',
       '2017 Manual Diesel 2.0', '2017 Automatic Diesel 2.0',
       '2017 Manual Petrol 1.2', '2017 Automatic Petrol 2.0',
       '2017 Automatic Petrol 1.5', '2017 Automatic Diesel 1.6',
       '2017 A

In [116]:

emissions_dict = {
    '2015 Manual Petrol 2.0' : 139,
    '2015 Manual Diesel 2.0' : 106,
    '2015 Manual Petrol 1.4' : 120,
    '2015 Manual Diesel 1.6' : 99,
    '2015 Automatic Diesel 1.6' : 102,
    '2015 Automatic Diesel 2.0' : 117,
    '2015 Automatic Petrol 2.0' : 145,
    '2015 Automatic Petrol 1.4' : 116,
    '2015 Manual Petrol 1.2' : 113, 
    '2016 Automatic Diesel 1.6' : 102,
    '2016 Manual Petrol 1.4' : 120, 
    '2016 Manual Diesel 2.0' : 109,
    '2016 Manual Diesel 1.6' : 103,
    '2016 Automatic Petrol 2.0' : 145,
    '2016 Manual Petrol 2.0' : 148,
    '2016 Automatic Diesel 2.0' : 117,
    '2016 Automatic Petrol 1.4' : 120,
    '2016 Manual Petrol 1.0' : 105,
    '2016 Automatic Petrol 1.0' : 103, 
    '2016 Manual Petrol 1.2' : 113,
    '2017 Manual Petrol 1.0' : 105,
    '2017 Manual Diesel 1.6' : 105,
    '2017 Manual Petrol 1.4' : 120, 
    '2017 Manual Petrol 2.0' : 162, 
    '2017 Automatic Petrol 1.4' : 116,
    '2017 Manual Diesel 2.0' : 114, 
    '2017 Automatic Diesel 2.0' : 127,
    '2017 Manual Petrol 1.2' : 113, 
    '2017 Automatic Petrol 2.0' : 160,
    '2017 Automatic Diesel 1.6' : 104,
    '2017 Automatic Petrol 1.0' : 103, 
    '2017 Automatic Diesel 1.4' : 102,
    '2018 Manual Petrol 1.4' : 120,
    '2018 Manual Petrol 2.0' : 148, 
    '2018 Automatic Petrol 2.0' : 148,
    '2018 Manual Petrol 1.0' : 108, 
    '2018 Manual Diesel 1.6' : 106,
    '2018 Manual Diesel 2.0' : 109, 
    '2018 Automatic Petrol 1.4' : 119,
    '2018 Manual Petrol 1.5' : 110, 
    '2018 Automatic Diesel 2.0' : 114,
    '2018 Automatic Petrol 1.5' : 110, 
    '2018 Automatic Diesel 1.6' : 102,
    '2018 Automatic Petrol 1.0' : 108, 
    '2019 Manual Diesel 1.6' : 109,
    '2019 Manual Petrol 1.5' : 130, 
    '2019 Automatic Petrol 1.5' : 130,
    '2019 Manual Petrol 1.0' : 109,
    '2019 Automatic Diesel 2.0' : 116,
    '2019 Manual Diesel 2.0' : 115,
    '2019 Automatic Diesel 1.6' : 104,
    '2019 Automatic Petrol 1.0' : 110, 
    '2020 Manual Diesel 1.6' : 134,
    '2020 Automatic Petrol 1.5' : 145, 
    '2020 Manual Petrol 1.5' : 135,
    '2020 Automatic Diesel 2.0' : 142, 
    '2020 Manual Diesel 2.0' : 137, 
    '2020 Manual Petrol 1.0' : 115
    }

golf['CO2 Emissions (g/km)'] = golf['setup'].map(emissions_dict).fillna("NA")

In [123]:
display(golf)

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,setup,CO2 Emissions (g/km)
1037,Golf,2015,18498,Manual,25102,Petrol,200,39.8,2.0,2015 Manual Petrol 2.0,139.0
4439,Golf,2015,11490,Manual,58000,Diesel,20,67.3,2.0,2015 Manual Diesel 2.0,106.0
2361,Golf,2015,11133,Manual,31722,Petrol,20,58.9,1.4,2015 Manual Petrol 1.4,120.0
5403,Golf,2015,10680,Manual,38535,Petrol,30,53.3,1.4,2015 Manual Petrol 1.4,120.0
1602,Golf,2015,11495,Manual,20012,Petrol,30,53.3,1.4,2015 Manual Petrol 1.4,120.0
...,...,...,...,...,...,...,...,...,...,...,...
4801,Golf,2020,20396,Manual,1555,Diesel,145,57.7,1.6,2020 Manual Diesel 1.6,134.0
4567,Golf,2020,27989,Automatic,8256,Petrol,145,37.7,2.0,2020 Automatic Petrol 2.0,
4985,Golf,2020,28640,Automatic,100,Petrol,145,47.1,1.5,2020 Automatic Petrol 1.5,145.0
5290,Golf,2020,25000,Manual,1000,Petrol,150,49.6,1.5,2020 Manual Petrol 1.5,135.0


[[Explanation/interpretation of the visualizations are depicting]]

### V. Data modeling

### VI. Conclusion