<a href="https://colab.research.google.com/github/haneeth25/Rain-in-Australia/blob/main/Rain_in_Australia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Context
Predict next-day rain by training classification models on the target variable RainTomorrow.

#Content
This dataset contains about 10 years of daily weather observations from many locations across Australia.


#Source & Acknowledgements
Observations were drawn from numerous weather stations. The daily observations are available from http://www.bom.gov.au/climate/data.
An example of latest weather observations in Canberra: http://www.bom.gov.au/climate/dwo/IDCJDW2801.latest.shtml

Definitions adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml
Data source: http://www.bom.gov.au/climate/dwo/ and http://www.bom.gov.au/climate/data.

Copyright Commonwealth of Australia 2010, Bureau of Meteorology.

> First we will import Numpy , Pandas , Matplotlib.pyplot

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

> Now first let us import the csv file into our google colab notebook 

In [2]:
weather_df = pd.read_csv("weatherAUS.csv")
weather_df.head(10)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No
5,2008-12-06,Albury,14.6,29.7,0.2,,,WNW,56.0,W,W,19.0,24.0,55.0,23.0,1009.2,1005.4,,,20.6,28.9,No,No
6,2008-12-07,Albury,14.3,25.0,0.0,,,W,50.0,SW,W,20.0,24.0,49.0,19.0,1009.6,1008.2,1.0,,18.1,24.6,No,No
7,2008-12-08,Albury,7.7,26.7,0.0,,,W,35.0,SSE,W,6.0,17.0,48.0,19.0,1013.4,1010.1,,,16.3,25.5,No,No
8,2008-12-09,Albury,9.7,31.9,0.0,,,NNW,80.0,SE,NW,7.0,28.0,42.0,9.0,1008.9,1003.6,,,18.3,30.2,No,Yes
9,2008-12-10,Albury,13.1,30.1,1.4,,,W,28.0,S,SSE,15.0,11.0,58.0,27.0,1007.0,1005.7,,,20.1,28.2,Yes,No


> Now we need to clean our data i.e remove all the unwanted rows , for that we need to do some research on the data given i.e (data exploration)  

# Data Exploration

* MinTemp = The lowest temperature recorded on that day 
* MaxTemp = The highest temperature recorded on that day
* Rainfall = We consider yes if the rain for that day was more than 1mm.
* Evaporation = the process by which an element or compound transitions from its liquid state to its gaseous state below the temperature at which it boils; in particular, the process by which liquid water enters the atmosphere as water vapour.
* SunShine = Sunlight 
* WindGustDir = The direction in which wind flows 
* WindGustSpeed = Wind speed 

> We can say that Rainfall doesn’t  depend on Date and Location

> Here both wind direction and wind speed are important because precipitation amounts are higher for events with easterly and southerly winds than for westerly and northerly winds and The rainfall loss rate increases when the wind speed increases and/or rainfall intensity decreases. 

In [3]:
#First we need to check whether we have any missing data or not
weather_df.isna().sum()
len(weather_df)

145460

In [4]:
#now we fill null values with their mean except RainTomorrow column 
weather_df["MinTemp"].fillna(weather_df["MinTemp"].mean(),inplace = True)
weather_df["MaxTemp"].fillna(weather_df["MaxTemp"].mean(),inplace = True)
weather_df["Rainfall"].fillna(weather_df["Rainfall"].mean(),inplace = True)
weather_df["Evaporation"].fillna(weather_df["Evaporation"].mean(),inplace = True)
weather_df["Sunshine"].fillna(weather_df["Sunshine"].mean(),inplace = True)
weather_df["WindGustDir"].fillna("missing",inplace = True)
weather_df["WindGustSpeed"].fillna(weather_df["WindGustSpeed"].mean(),inplace = True)
weather_df["WindDir9am"].fillna("missing",inplace = True)
weather_df["WindDir3pm"].fillna("missing",inplace = True)
weather_df["WindSpeed9am"].fillna(weather_df["WindSpeed9am"].mean(),inplace = True)
weather_df["WindSpeed3pm"].fillna(weather_df["WindSpeed3pm"].mean(),inplace = True)
weather_df["Humidity9am"].fillna(weather_df["Humidity9am"].mean(),inplace = True)
weather_df["Humidity3pm"].fillna(weather_df["Humidity3pm"].mean(),inplace = True)
weather_df["Pressure9am"].fillna(weather_df["Pressure9am"].mean(),inplace = True)
weather_df["Pressure3pm"].fillna(weather_df["Pressure3pm"].mean(),inplace = True)
weather_df["Cloud9am"].fillna(weather_df["Cloud9am"].mean(),inplace = True)
weather_df["Cloud3pm"].fillna(weather_df["Cloud3pm"].mean(),inplace = True)
weather_df["Temp9am"].fillna(weather_df["Temp9am"].mean(),inplace = True)
weather_df["Temp3pm"].fillna(weather_df["Temp3pm"].mean(),inplace = True)
weather_df.isna().sum()

Date                0
Location            0
MinTemp             0
MaxTemp             0
Rainfall            0
Evaporation         0
Sunshine            0
WindGustDir         0
WindGustSpeed       0
WindDir9am          0
WindDir3pm          0
WindSpeed9am        0
WindSpeed3pm        0
Humidity9am         0
Humidity3pm         0
Pressure9am         0
Pressure3pm         0
Cloud9am            0
Cloud3pm            0
Temp9am             0
Temp3pm             0
RainToday        3261
RainTomorrow     3267
dtype: int64

In [5]:
# since the RainToday and RainTommorrow show a large impact on our traing model instead of filling them we just removed missing values from those those columns 
weather_df = weather_df.dropna(subset=['RainToday', 'RainTomorrow'])
weather_df.isna().sum()

Date             0
Location         0
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

#Splitting data into x and y labels

In [6]:
#Here RainTomorrow is our target column 
#First we need to split the data into X and Y labels 
x = weather_df.drop(["Date","Location","RainTomorrow"],axis = 1)
y = weather_df["RainTomorrow"]

In [7]:
x

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
0,13.4,22.9,0.6,5.468232,7.611178,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.000000,4.50993,16.9,21.8,No
1,7.4,25.1,0.0,5.468232,7.611178,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,4.447461,4.50993,17.2,24.3,No
2,12.9,25.7,0.0,5.468232,7.611178,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,4.447461,2.00000,21.0,23.2,No
3,9.2,28.0,0.0,5.468232,7.611178,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,4.447461,4.50993,18.1,26.5,No
4,17.5,32.3,1.0,5.468232,7.611178,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.000000,8.00000,17.8,29.7,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,3.5,21.8,0.0,5.468232,7.611178,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,4.447461,4.50993,9.4,20.9,No
145455,2.8,23.4,0.0,5.468232,7.611178,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,4.447461,4.50993,10.1,22.4,No
145456,3.6,25.3,0.0,5.468232,7.611178,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,4.447461,4.50993,10.9,24.5,No
145457,5.4,26.9,0.0,5.468232,7.611178,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,4.447461,4.50993,12.5,26.1,No


In [8]:
y

0         No
1         No
2         No
3         No
4         No
          ..
145454    No
145455    No
145456    No
145457    No
145458    No
Name: RainTomorrow, Length: 140787, dtype: object

In [9]:
x.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 140787 entries, 0 to 145458
Data columns (total 20 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   MinTemp        140787 non-null  float64
 1   MaxTemp        140787 non-null  float64
 2   Rainfall       140787 non-null  float64
 3   Evaporation    140787 non-null  float64
 4   Sunshine       140787 non-null  float64
 5   WindGustDir    140787 non-null  object 
 6   WindGustSpeed  140787 non-null  float64
 7   WindDir9am     140787 non-null  object 
 8   WindDir3pm     140787 non-null  object 
 9   WindSpeed9am   140787 non-null  float64
 10  WindSpeed3pm   140787 non-null  float64
 11  Humidity9am    140787 non-null  float64
 12  Humidity3pm    140787 non-null  float64
 13  Pressure9am    140787 non-null  float64
 14  Pressure3pm    140787 non-null  float64
 15  Cloud9am       140787 non-null  float64
 16  Cloud3pm       140787 non-null  float64
 17  Temp9am        140787 non-nul

In [10]:
x

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
0,13.4,22.9,0.6,5.468232,7.611178,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.000000,4.50993,16.9,21.8,No
1,7.4,25.1,0.0,5.468232,7.611178,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,4.447461,4.50993,17.2,24.3,No
2,12.9,25.7,0.0,5.468232,7.611178,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,4.447461,2.00000,21.0,23.2,No
3,9.2,28.0,0.0,5.468232,7.611178,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,4.447461,4.50993,18.1,26.5,No
4,17.5,32.3,1.0,5.468232,7.611178,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.000000,8.00000,17.8,29.7,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,3.5,21.8,0.0,5.468232,7.611178,E,31.0,ESE,E,15.0,13.0,59.0,27.0,1024.7,1021.2,4.447461,4.50993,9.4,20.9,No
145455,2.8,23.4,0.0,5.468232,7.611178,E,31.0,SE,ENE,13.0,11.0,51.0,24.0,1024.6,1020.3,4.447461,4.50993,10.1,22.4,No
145456,3.6,25.3,0.0,5.468232,7.611178,NNW,22.0,SE,N,13.0,9.0,56.0,21.0,1023.5,1019.1,4.447461,4.50993,10.9,24.5,No
145457,5.4,26.9,0.0,5.468232,7.611178,N,37.0,SE,WNW,9.0,9.0,53.0,24.0,1021.0,1016.8,4.447461,4.50993,12.5,26.1,No


In [11]:
# Now we need to convert all the columns with object type to float64
changed_x = pd.get_dummies(x)
changed_x

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,WindGustDir_E,WindGustDir_ENE,WindGustDir_ESE,WindGustDir_N,WindGustDir_NE,WindGustDir_NNE,WindGustDir_NNW,WindGustDir_NW,WindGustDir_S,WindGustDir_SE,WindGustDir_SSE,WindGustDir_SSW,WindGustDir_SW,WindGustDir_W,WindGustDir_WNW,WindGustDir_WSW,WindGustDir_missing,WindDir9am_E,WindDir9am_ENE,WindDir9am_ESE,WindDir9am_N,WindDir9am_NE,WindDir9am_NNE,WindDir9am_NNW,WindDir9am_NW,WindDir9am_S,WindDir9am_SE,WindDir9am_SSE,WindDir9am_SSW,WindDir9am_SW,WindDir9am_W,WindDir9am_WNW,WindDir9am_WSW,WindDir9am_missing,WindDir3pm_E,WindDir3pm_ENE,WindDir3pm_ESE,WindDir3pm_N,WindDir3pm_NE,WindDir3pm_NNE,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW,WindDir3pm_missing,RainToday_No,RainToday_Yes
0,13.4,22.9,0.6,5.468232,7.611178,44.0,20.0,24.0,71.0,22.0,1007.7,1007.1,8.000000,4.50993,16.9,21.8,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
1,7.4,25.1,0.0,5.468232,7.611178,44.0,4.0,22.0,44.0,25.0,1010.6,1007.8,4.447461,4.50993,17.2,24.3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
2,12.9,25.7,0.0,5.468232,7.611178,46.0,19.0,26.0,38.0,30.0,1007.6,1008.7,4.447461,2.00000,21.0,23.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0
3,9.2,28.0,0.0,5.468232,7.611178,24.0,11.0,9.0,45.0,16.0,1017.6,1012.8,4.447461,4.50993,18.1,26.5,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,17.5,32.3,1.0,5.468232,7.611178,41.0,7.0,20.0,82.0,33.0,1010.8,1006.0,7.000000,8.00000,17.8,29.7,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145454,3.5,21.8,0.0,5.468232,7.611178,31.0,15.0,13.0,59.0,27.0,1024.7,1021.2,4.447461,4.50993,9.4,20.9,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
145455,2.8,23.4,0.0,5.468232,7.611178,31.0,13.0,11.0,51.0,24.0,1024.6,1020.3,4.447461,4.50993,10.1,22.4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
145456,3.6,25.3,0.0,5.468232,7.611178,22.0,13.0,9.0,56.0,21.0,1023.5,1019.1,4.447461,4.50993,10.9,24.5,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
145457,5.4,26.9,0.0,5.468232,7.611178,37.0,9.0,9.0,53.0,24.0,1021.0,1016.8,4.447461,4.50993,12.5,26.1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0


In [12]:
# Now we need to convert all the columns with object type to float64
changed_y = pd.get_dummies(y)
changed_y

Unnamed: 0,No,Yes
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
145454,1,0
145455,1,0
145456,1,0
145457,1,0


> Now we need to create train sets and test set

In [13]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(changed_x,changed_y,test_size = 0.2)

In [14]:
x_train

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,WindGustDir_E,WindGustDir_ENE,WindGustDir_ESE,WindGustDir_N,WindGustDir_NE,WindGustDir_NNE,WindGustDir_NNW,WindGustDir_NW,WindGustDir_S,WindGustDir_SE,WindGustDir_SSE,WindGustDir_SSW,WindGustDir_SW,WindGustDir_W,WindGustDir_WNW,WindGustDir_WSW,WindGustDir_missing,WindDir9am_E,WindDir9am_ENE,WindDir9am_ESE,WindDir9am_N,WindDir9am_NE,WindDir9am_NNE,WindDir9am_NNW,WindDir9am_NW,WindDir9am_S,WindDir9am_SE,WindDir9am_SSE,WindDir9am_SSW,WindDir9am_SW,WindDir9am_W,WindDir9am_WNW,WindDir9am_WSW,WindDir9am_missing,WindDir3pm_E,WindDir3pm_ENE,WindDir3pm_ESE,WindDir3pm_N,WindDir3pm_NE,WindDir3pm_NNE,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW,WindDir3pm_missing,RainToday_No,RainToday_Yes
138650,16.1,34.2,0.0,25.000000,11.100000,33.00000,17.0,15.000000,21.0,13.000000,1014.00000,1009.600000,4.0,4.00000,25.4,32.50000,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
65723,3.2,15.6,0.0,2.000000,6.500000,65.00000,26.0,31.000000,79.0,50.000000,1018.90000,1016.600000,3.0,7.00000,8.9,14.40000,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
103520,2.4,16.6,0.0,2.200000,11.400000,30.00000,15.0,11.000000,59.0,21.000000,1025.30000,1021.300000,1.0,3.00000,10.0,16.20000,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
30675,9.4,15.6,0.0,3.400000,0.300000,40.03523,6.0,2.000000,74.0,61.000000,1014.50000,1011.900000,7.0,7.00000,11.0,14.40000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
11702,1.7,16.8,0.0,5.468232,7.611178,26.00000,6.0,9.000000,76.0,45.000000,1021.50000,1017.300000,1.0,7.00000,9.2,15.30000,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33074,19.9,34.2,0.0,6.800000,10.600000,43.00000,7.0,26.000000,54.0,46.000000,1015.90000,1011.600000,1.0,1.00000,25.0,29.30000,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
65473,6.6,16.4,0.0,3.200000,6.000000,70.00000,20.0,35.000000,78.0,46.000000,1009.50000,1006.200000,7.0,7.00000,10.3,15.40000,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
62408,4.1,14.4,2.0,0.600000,8.400000,24.00000,9.0,9.000000,91.0,53.000000,1028.20000,1025.900000,3.0,3.00000,7.6,14.20000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
9490,22.0,28.1,0.0,3.400000,9.100000,26.00000,9.0,9.000000,75.0,77.000000,1014.40000,1013.400000,4.0,5.00000,26.2,26.10000,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0


In [15]:
x_test

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,WindGustDir_E,WindGustDir_ENE,WindGustDir_ESE,WindGustDir_N,WindGustDir_NE,WindGustDir_NNE,WindGustDir_NNW,WindGustDir_NW,WindGustDir_S,WindGustDir_SE,WindGustDir_SSE,WindGustDir_SSW,WindGustDir_SW,WindGustDir_W,WindGustDir_WNW,WindGustDir_WSW,WindGustDir_missing,WindDir9am_E,WindDir9am_ENE,WindDir9am_ESE,WindDir9am_N,WindDir9am_NE,WindDir9am_NNE,WindDir9am_NNW,WindDir9am_NW,WindDir9am_S,WindDir9am_SE,WindDir9am_SSE,WindDir9am_SSW,WindDir9am_SW,WindDir9am_W,WindDir9am_WNW,WindDir9am_WSW,WindDir9am_missing,WindDir3pm_E,WindDir3pm_ENE,WindDir3pm_ESE,WindDir3pm_N,WindDir3pm_NE,WindDir3pm_NNE,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW,WindDir3pm_missing,RainToday_No,RainToday_Yes
70014,5.0,13.8,0.6,2.000000,4.500000,19.0,7.0,9.0,81.0,51.0,1037.00000,1035.700000,7.000000,4.00000,7.4,13.3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
42644,19.5,21.4,0.0,5.468232,7.611178,41.0,30.0,30.0,82.0,75.0,1019.20000,1019.900000,8.000000,8.00000,20.5,20.2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
136889,20.4,35.2,0.0,10.800000,7.200000,35.0,0.0,22.0,50.0,31.0,1006.10000,1003.100000,5.000000,7.00000,26.6,33.5,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
25254,16.2,18.8,11.4,5.468232,7.611178,26.0,15.0,11.0,99.0,99.0,1017.64994,1015.255889,4.447461,4.50993,16.4,17.4,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1
93963,22.8,29.6,17.4,6.400000,8.100000,39.0,22.0,26.0,76.0,68.0,1009.80000,1005.500000,4.000000,5.00000,26.6,28.1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25601,15.4,30.9,0.0,5.468232,7.611178,24.0,4.0,11.0,88.0,39.0,1017.64994,1015.255889,4.447461,4.50993,20.0,30.3,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
122463,10.0,20.3,3.0,3.400000,9.500000,31.0,9.0,11.0,63.0,53.0,1022.00000,1020.000000,6.000000,2.00000,16.8,18.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
14511,23.1,35.2,0.6,13.800000,5.700000,41.0,20.0,9.0,70.0,34.0,1017.70000,1014.700000,7.000000,6.00000,23.7,31.4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
34901,12.9,25.2,0.0,6.000000,11.300000,61.0,17.0,39.0,47.0,41.0,1025.00000,1019.500000,4.000000,7.00000,20.0,23.4,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [16]:
y_train

Unnamed: 0,No,Yes
138650,1,0
65723,1,0
103520,1,0
30675,0,1
11702,1,0
...,...,...
33074,1,0
65473,0,1
62408,1,0
9490,1,0


In [17]:
y_test

Unnamed: 0,No,Yes
70014,1,0
42644,1,0
136889,1,0
25254,0,1
93963,0,1
...,...,...
25601,1,0
122463,1,0
14511,1,0
34901,1,0


In [18]:
# now we need to check which model provides more accuracy 

from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier 
# put all these models into a dictionary 

models = {
    "KNN" : KNeighborsClassifier(),
    "random forest classifier" : RandomForestClassifier()
}

def fit_and_score(models,x_train,x_test,y_train,y_test):
  """
  fits and evaluates given machine learning models.
  models: a dict of differnt sklearn models
  x_train:training data
  x_test:tresting data
  Y_train:traing data
  y_test:testing data
  """
  #set random seed
  np.random.seed(42)
  #make a dictionary to keep model scores 
  model_scores = {}

  #loop through models 
  for name, model in models.items():
    #fit the model to data 
    model.fit(x_train,y_train)
    #evaluate the model and append its score to model_scores
    model_scores[name] = model.score(x_test,y_test)
  return model_scores

In [19]:
model_scores = fit_and_score(models,
                             x_train,
                             x_test,
                             y_train,
                             y_test)
model_scores

{'KNN': 0.8349314582001562, 'random forest classifier': 0.8514099012713972}

In [20]:
x1_train,x1_test,y1_train,y1_test = train_test_split(changed_x,y,test_size = 0.2)

> Since Logistic Regression expects labelled y_train, you do not need to do OneHotEncoding.
so we took logistic regression score seperatley

In [21]:
np.random.seed(42)
model = LogisticRegression()
model.fit(x1_train,y1_train)
model.score(x1_test,y1_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.8382342495915903

#'KNN': 0.8345763193408623, 'random forest classifier': 0.8524398039633496 , 'logistic regression' : 0.8382342495915903 

so now we will go with randomforest classifier and now we will try to tune it to get a goog result

In [22]:
#create a hyperparameter grid for randomforest classifier
rf_grid = {"n_estimators" : np.arange(10,100,50),
           "max_depth" : [None,3,5,10],
           "min_samples_split" : np.arange(2,20,2),
           "min_samples_leaf" : np.arange(1,20,2)}

In [23]:
#Tune random forest classifier
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV
np.random.seed(42)
rfc_tuning = RandomizedSearchCV(RandomForestClassifier(),
                                param_distributions = rf_grid,
                                cv = 5,
                                n_iter = 20,
                                verbose = True)

rfc_tuning.fit(x_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:  8.4min finished


RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    ccp_alpha=0.0,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    max_samples=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
               

In [25]:
rfc_tuning.score(x_test,y_test)

0.8512323318417502