## Weather in F1: Data Mining

The dataframe bellow needs to be filled with ones corresponding to the weather conditions in each race. Considering that the adherence of the circuit is one of the most important factors for a driver's performance, the presence/absence of clouds or sun will be ignored, and only the degree of wetness of the road will be recorded.

In [229]:
races_df = pd.read_csv(r'C:\Users\joaoe\Documentos\JP\JP_GitHub\f1_predict_ML\data\races.csv')
races_df = races_df.drop(columns=['raceId','round','time','circuitId','url'])
races_df = races_df.sort_values(by=['date'])
races_df = races_df.reset_index(drop=True)
length = len(races_df)
races_df.head()

weather = {'Descriptive': ["" for x in range(length)],'Dry': np.zeros(length), 'Wet': np.zeros(length),'Drizzle': np.zeros(length),'Rain': np.zeros(length)}
weather_df = pd.DataFrame(data=weather)

weather_df = pd.concat([races_df.reset_index(drop=True),weather_df.reset_index(drop=True)], axis=1)

In [262]:
weather_df.iloc[897:936].to_csv(r'C:\Users\joaoe\Documentos\JP\JP_GitHub\f1_predict_ML\data\weather_2014_2015_v2.csv', index = False)

### **1st Attempt**: Data scraping [F1 Facts](https://f1-facts.com/), a website with historical statistics from F1 races, from 1950 to 2018
The first step was using the Beautiful Soup library to read the html and capture the labels for each weather condition.

In [239]:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import pandas as pd
import numpy as np

# Reading through html file
myurl = 'https://f1-facts.com/stats/weather'
uClient = uReq(myurl)
page_html = uClient.read()
page_soup = soup(page_html,"lxml")

# Finding table containing the desired information
table = page_soup.find(text="1950").find_parent("table")

Now, we must capture the weather information, shown by an icon which defines if there is rain, sun, clouds, etc. We can read the source from each icon, and mapped that link to a specific weather state.

In [240]:
# Reading all image sources: ordered chronologically
images = table.findChildren("img")

# Creating a list with the source links converted to string values
image_list=[[] for i in range(len(images))]
for i in range(len(images)):
    image_list[i] = str(images[i])

Going from 1950 to 2018, cronologically, the digits corresponding to the weather condition are read, and converted into a descriptive text. 

In [241]:
# Up to 2018, 2015 missing 

pd.set_option('mode.chained_assignment', None)
for i in range(len(images)):
    if image_list[i][35:38] == 'w1.':
        weather_df['Descriptive'][i] = 'Sun, Dry'
        weather_df['Dry'][i] = 1
        
    elif image_list[i][35:38] == 'w2.':
        weather_df['Descriptive'][i]  = 'Dry, Clouds'
        weather_df['Dry'][i] = 1
        
    elif image_list[i][35:38] == 'w3.':
        weather_df['Descriptive'][i]  = 'Dry, Cold'
        weather_df['Dry'][i] = 1
        
    elif image_list[i][35:38] == 'w4.':
        weather_df['Descriptive'][i]  = 'Sun, short drizzle'
        weather_df['Drizzle'][i] = 1
        
    elif image_list[i][35:38] == 'w5.':
        weather_df['Descriptive'][i]  = 'Rain'
        weather_df['Rain'][i] = 1
        
    elif image_list[i][35:38] == 'w6.':
        weather_df['Descriptive'][i]  = 'Wet'
        weather_df['Wet'][i] = 1
        
    elif image_list[i][35:38] == 'w7.':
        weather_df['Descriptive'][i]  = 'Rain, then dry'
        weather_df['Rain'][i] = 1
        weather_df['Dry'][i] = 1
        
    elif image_list[i][35:38] == 'w8.':
        weather_df['Descriptive'][i]  = 'Dry, Cloudy'
        weather_df['Dry'][i] = 1

    elif image_list[i][35:38] == 'w9.':
        weather_df['Descriptive'][i]  = 'Rain'  
        weather_df['Rain'][i] = 1
        
    elif image_list[i][35:38] == 'w10':
        weather_df['Descriptive'][i]  = 'Dry, hot' 
        weather_df['Dry'][i] = 1
        
    elif image_list[i][35:38] == 'w.':
        weather_df['Descriptive'][i]  = 'Unknown'     

### **2nd Attempt**: Adding values from a .csv corresponding to 2014-2015
Because there were gaps on the weather for those 2 years, a file was used, were the information from different sources was compiled.

In [263]:
weather_df_2014 = pd.read_csv(r'C:\Users\joaoe\Documentos\JP\JP_GitHub\f1_predict_ML\data\weather_2014_2015_v2.csv')
weather_df_2014['Dry'] = np.zeros(len(weather_df_2014))
weather_df_2014['Wet'] = np.zeros(len(weather_df_2014))
weather_df_2014['Drizzle'] = np.zeros(len(weather_df_2014))
weather_df_2014['Rain'] = np.zeros(len(weather_df_2014))

In [264]:
weather_df_2014.head(1)

Unnamed: 0,year,name,date,Descriptive,Dry,Wet,Drizzle,Rain
0,2014,Australian Grand Prix,3/16/2014,Dry,0.0,0.0,0.0,0.0


In [254]:
for i in range(len(weather_df_2014)):
    if weather_df_2014['Weather'][i] == 'Dry':
        weather_df_2014['Dry'][i] = 1
        
    elif weather_df_2014['Weather'][i] == 'Wet':
        weather_df_2014['Wet'][i] = 1
        
    elif weather_df_2014['Weather'][i] == 'Drizzle':
        weather_df_2014['Drizzle'][i] = 1
        
    elif weather_df_2014['Weather'][i] == 'Rain':
        weather_df_2014['Rain'][i] = 1
        

In [267]:
weather_df_2014

Unnamed: 0,year,name,date,Descriptive,Dry,Wet,Drizzle,Rain
0,2014,Australian Grand Prix,3/16/2014,Dry,0.0,0.0,0.0,0.0
1,2014,Malaysian Grand Prix,3/30/2014,Dry,0.0,0.0,0.0,0.0
2,2014,Bahrain Grand Prix,4/6/2014,Dry,0.0,0.0,0.0,0.0
3,2014,Chinese Grand Prix,4/20/2014,Dry,0.0,0.0,0.0,0.0
4,2014,Spanish Grand Prix,5/11/2014,Dry,0.0,0.0,0.0,0.0
5,2014,Monaco Grand Prix,5/25/2014,Wet,0.0,0.0,0.0,0.0
6,2014,Canadian Grand Prix,6/8/2014,Dry,0.0,0.0,0.0,0.0
7,2014,Austrian Grand Prix,6/22/2014,Dry,0.0,0.0,0.0,0.0
8,2014,British Grand Prix,7/6/2014,Dry,0.0,0.0,0.0,0.0
9,2014,German Grand Prix,7/20/2014,Dry,0.0,0.0,0.0,0.0
