<a href="https://colab.research.google.com/github/kai054631/Weather_Data/blob/main/Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [168]:
from pandas import read_csv
from sklearn.model_selection import GridSearchCV,KFold, cross_val_score,train_test_split as split

from sklearn.linear_model import LogisticRegression as LGR
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.tree import DecisionTreeClassifier as DTC
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.naive_bayes import GaussianNB as GNB
from sklearn.decomposition import PCA
from sklearn.svm import SVC


from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

Read dataset from github and show frist 5 rows

In [169]:
df = read_csv("https://raw.githubusercontent.com/Des282/Dataset/refs/heads/main/seattle-weather.csv")#store and read
df.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


check if any data is missing

In [170]:
df.isna().sum()

Unnamed: 0,0
date,0
precipitation,0
temp_max,0
temp_min,0
wind,0
weather,0


show the size of dataset

In [171]:
df.shape

(1461, 6)

In [172]:
df = df.drop(columns = ["date"])
X = df.drop(columns=["weather"])
y = df["weather"]

In [173]:
X.shape

(1461, 4)

Train and evaluate at least five different machine learning algorithms, such as:

• Linear Regression/Logistic Regression    
• Decision Trees     
• Random Forest    
• Support Vector Machines (SVM)    
• k-Nearest Neighbors (kNN)    
• Gradient Boosting Trees   
• Neural Networks   

In [174]:
X_train,X_test,y_train,y_test=split(X,y,test_size=0.25,random_state=25)

In [200]:
#Use spot-checking to quickly evaluate the performance of 8 machine learning algorithms
models = {}
models['lgr'] = LGR() #r2 score
models['knn'] = KNN() #accuracy
models['dtc'] = DTC() #accuracy
models['rfc'] = RFC() #accuracy
models['gbc'] = GBC() #accuracy
models['gnb'] = GNB() #accuracy
models['svc'] = SVC() #accuracy

#before hyperparameter tuning
kf = KFold(n_splits=5,shuffle=True,random_state=42) #set default n_splits = 5
for n in models:
  score=cross_val_score(models[n], X, y, cv=kf, n_jobs=-1)
  print(f'{n}:{score.mean():.3%},  {score.std():.3%}')

lgr:84.600%,  1.306%
knn:76.385%,  2.137%
dtc:76.799%,  3.273%
rfc:83.163%,  1.685%
gbc:82.821%,  1.729%
gnb:84.258%,  1.777%
svc:77.687%,  1.344%


In [201]:
Robust_scl = RobustScaler()#Feature Scaling with RobustScaler
Xs1 = Robust_scl.fit_transform(X)

MinMax_scl = MinMaxScaler()#Feature Scaling with MinMaxScaler
Xs2 = MinMax_scl.fit_transform(X)

Standard_scl = StandardScaler()#Feature Scaling with StandardScaler
Xs3 = Standard_scl.fit_transform(X)

In [202]:
print(f"Feature Scaling with")
for n in models:
    scores = cross_val_score(models[n], Xs1, y, cv=kf, n_jobs=-1) #get the accuracy
    print(f"ROBUST SCALING {n}: {scores.mean():.3%}, {scores.std():.3%}")#get the mean and standard deviation of accuracy
print(" ")
for i in models:
    scores = cross_val_score(models[i], Xs2, y, cv=kf, n_jobs=-1) #get the accuracy
    print(f"MINMAX SCALING {i}: {scores.mean():.3%}, {scores.std():.3%}")#get the mean and standard deviation of accuracy
print(" ")
for j in models:
    scores = cross_val_score(models[j], Xs3, y, cv=kf, n_jobs=-1) #get the accuracy
    print(f"STANDARD SCALING {j}: {scores.mean():.3%}, {scores.std():.3%}")#get the mean and standard deviation of accuracy


Feature Scaling with
ROBUST SCALING lgr: 81.931%, 1.370%
ROBUST SCALING knn: 78.097%, 1.249%
ROBUST SCALING dtc: 76.661%, 2.516%
ROBUST SCALING rfc: 83.027%, 1.538%
ROBUST SCALING gbc: 82.821%, 1.729%
ROBUST SCALING gnb: 84.258%, 1.777%
ROBUST SCALING svc: 78.371%, 1.300%
 
MINMAX SCALING lgr: 73.512%, 1.373%
MINMAX SCALING knn: 70.430%, 1.923%
MINMAX SCALING dtc: 76.182%, 3.116%
MINMAX SCALING rfc: 83.026%, 1.498%
MINMAX SCALING gbc: 82.821%, 1.729%
MINMAX SCALING gnb: 84.258%, 1.777%
MINMAX SCALING svc: 75.702%, 1.230%
 
STANDARD SCALING lgr: 78.371%, 0.894%
STANDARD SCALING knn: 72.415%, 1.430%
STANDARD SCALING dtc: 75.977%, 2.783%
STANDARD SCALING rfc: 82.958%, 1.653%
STANDARD SCALING gbc: 82.752%, 1.649%
STANDARD SCALING gnb: 84.258%, 1.777%
STANDARD SCALING svc: 77.551%, 1.173%


In [203]:
#Feature Addition method: create new features based on the existing features
win_size = 4
df['mean_precipitation'] = df['precipitation'].rolling(win_size).mean() #df['mean_temp_max'] = df['temp_max'].rolling(win_size).mean()
df['mean_temp'] = df['temp_max'] + df['temp_min'] / 2 #df['mean_temp_min'] = df['temp_min'].rolling(win_size).mean()
df['mean_wind'] = df['wind'].rolling(win_size).mean()

print(df.shape)
print(df.head(10))

(1458, 8)
    precipitation  temp_max  temp_min  wind weather  mean_precipitation  \
3            20.3      12.2       5.6   4.7    rain                 NaN   
4             1.3       8.9       2.8   6.1    rain                 NaN   
5             2.5       4.4       2.2   2.2    rain                 NaN   
6             0.0       7.2       2.8   2.3    rain               6.025   
7             0.0      10.0       2.8   2.0     sun               0.950   
8             4.3       9.4       5.0   3.4    rain               1.700   
9             1.0       6.1       0.6   3.4    rain               1.325   
10            0.0       6.1      -1.1   5.1     sun               1.325   
11            0.0       6.1      -1.7   1.9     sun               1.325   
12            0.0       5.0      -2.8   1.3     sun               0.250   

    mean_temp  mean_wind  
3       15.00        NaN  
4       10.30        NaN  
5        5.50        NaN  
6        8.60      3.825  
7       11.40      3.150  
8 

In [204]:
df.isna().sum()

Unnamed: 0,0
precipitation,0
temp_max,0
temp_min,0
wind,0
weather,0
mean_precipitation,3
mean_temp,0
mean_wind,3


In [205]:
df = df.dropna() # the dropna() method that can be used to drop the missing data
df.head()

Unnamed: 0,precipitation,temp_max,temp_min,wind,weather,mean_precipitation,mean_temp,mean_wind
6,0.0,7.2,2.8,2.3,rain,6.025,8.6,3.825
7,0.0,10.0,2.8,2.0,sun,0.95,11.4,3.15
8,4.3,9.4,5.0,3.4,rain,1.7,11.9,2.475
9,1.0,6.1,0.6,3.4,rain,1.325,6.4,2.775
10,0.0,6.1,-1.1,5.1,sun,1.325,5.55,3.475


In [206]:
X_New = df.drop(columns=["weather"])
y_New = df["weather"]

Xs1 = Robust_scl.fit_transform(X_New)
Xs2 = MinMax_scl.fit_transform(X_New)
Xs3 = Standard_scl.fit_transform(X_New)

In [207]:
print(X_New.shape)
print(y_New.shape)

(1455, 7)
(1455,)


In [208]:
print(f"create new features based on the existing features")

for n in models:
    scores = cross_val_score(models[n], Xs1, y_New, cv=kf, n_jobs=-1) #get the accuracy
    print(f"ROBUST SCALING {n}: {scores.mean():.3%}, {scores.std():.3%}")#get the mean and standard deviation of accuracy
print(" ")
for i in models:
    scores = cross_val_score(models[i], Xs2, y_New, cv=kf, n_jobs=-1) #get the accuracy
    print(f"MINMAX SCALING {i}: {scores.mean():.3%}, {scores.std():.3%}")#get the mean and standard deviation of accuracy
print(" ")
for j in models:
    scores = cross_val_score(models[j], Xs3, y_New, cv=kf, n_jobs=-1) #get the accuracy
    print(f"STANDARD SCALING {j}: {scores.mean():.3%}, {scores.std():.3%}")#get the mean and standard deviation of accuracy

create new features based on the existing features
ROBUST SCALING lgr: 81.168%, 1.528%
ROBUST SCALING knn: 74.296%, 0.825%
ROBUST SCALING dtc: 76.701%, 3.115%
ROBUST SCALING rfc: 85.292%, 1.703%
ROBUST SCALING gbc: 84.055%, 2.347%
ROBUST SCALING gnb: 80.825%, 2.510%
ROBUST SCALING svc: 78.419%, 0.825%
 
MINMAX SCALING lgr: 73.127%, 2.167%
MINMAX SCALING knn: 70.859%, 1.329%
MINMAX SCALING dtc: 77.938%, 3.145%
MINMAX SCALING rfc: 85.636%, 1.824%
MINMAX SCALING gbc: 84.192%, 1.895%
MINMAX SCALING gnb: 80.756%, 2.430%
MINMAX SCALING svc: 74.708%, 1.546%
 
STANDARD SCALING lgr: 78.282%, 0.957%
STANDARD SCALING knn: 70.103%, 1.229%
STANDARD SCALING dtc: 77.526%, 2.512%
STANDARD SCALING rfc: 85.361%, 1.852%
STANDARD SCALING gbc: 84.261%, 2.178%
STANDARD SCALING gnb: 80.825%, 2.510%
STANDARD SCALING svc: 76.220%, 0.907%


In [209]:
#check with
#Hyperparameter tuning for Lgr
lgr= LGR().fit(X_train, y_train)
print(f'R2 score of LGR: {lgr.score(X_test, y_test):.3%}')
print(" ")
models = {}

#Hyperparameter tuning for KNN
for i in range(1,11,2):
  knn= KNN(n_neighbors=i).fit(X_train, y_train)
  #scores= cross_val_score(knn_scores, X, y, cv=kf, n_jobs=-1)
  score=knn.score(X_test,y_test)
  print(f"Accuracy of KNN with neighbors={i}: {score:.3%}")
  y_pred = lgr.predict(X_test)
print(" ")

#Hyperparameter tuning for GNB
gnb = GNB().fit(X_train, y_train)
print(f'Accuracy of GNB: {gnb.score(X_test, y_test):.3%}')
print(" ")

#Hyperparameter tuning for SVC
for i in range(0,11,2):
  svc = SVC(kernel='rbf', C=1, gamma=i/10).fit(X_train, y_train)
  print(f"Accuracy of SVC with gamma={i/10}: {svc.score(X_test,y_test):.3%}")
print(" ")

#Hyperparameter tuning for DTC
for i in range(1,11,2):
  dtc = DTC(max_depth=i,random_state=42).fit(X_train, y_train)
  print(f'Accuracy of DTC with Max Depth={i}: {dtc.score(X_test, y_test):.2%}')
print(" ")

#Hyperparameter tuning for RFC
for i in range(1,11,2):
  rfc = RFC(max_depth=i, random_state=42).fit(X_train, y_train)
  print(f'Accuracy of RFC with Max Depth={i}: {rfc.score(X_test, y_test):.2%}')
print(" ")

#Hyperparameter tuning for GBC
for i in range(1,11,2):
  gbc= GBC(max_depth=i, random_state=42).fit(X_train, y_train)
  print(f'Accuracy of GBC with Max Depth={i}: {100 * gbc.score(X_test, y_test):.2f} %')
print(" ")


R2 score of LGR: 83.333%
 
Accuracy of KNN with neighbors=1: 71.585%
Accuracy of KNN with neighbors=3: 74.317%
Accuracy of KNN with neighbors=5: 74.044%
Accuracy of KNN with neighbors=7: 76.230%


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy of KNN with neighbors=9: 75.956%
 
Accuracy of GNB: 84.153%
 
Accuracy of SVC with gamma=0.0: 43.443%
Accuracy of SVC with gamma=0.2: 78.962%
Accuracy of SVC with gamma=0.4: 77.596%
Accuracy of SVC with gamma=0.6: 76.503%
Accuracy of SVC with gamma=0.8: 74.317%
Accuracy of SVC with gamma=1.0: 74.044%
 
Accuracy of DTC with Max Depth=1: 84.43%
Accuracy of DTC with Max Depth=3: 85.52%
Accuracy of DTC with Max Depth=5: 82.79%
Accuracy of DTC with Max Depth=7: 83.06%
Accuracy of DTC with Max Depth=9: 80.05%
 
Accuracy of RFC with Max Depth=1: 84.43%
Accuracy of RFC with Max Depth=3: 85.79%
Accuracy of RFC with Max Depth=5: 85.52%
Accuracy of RFC with Max Depth=7: 85.52%
Accuracy of RFC with Max Depth=9: 84.70%
 
Accuracy of GBC with Max Depth=1: 84.97 %
Accuracy of GBC with Max Depth=3: 82.79 %
Accuracy of GBC with Max Depth=5: 82.24 %
Accuracy of GBC with Max Depth=7: 81.15 %
Accuracy of GBC with Max Depth=9: 81.15 %
 


In [210]:
#after hyper parameter tuning
models_new = {}
models_new['lgr'] = LGR() #r2 score
models_new['knn'] = KNN(n_neighbors=7) #accuracy
models_new['dtc'] = DTC(max_depth=3,random_state=42) #accuracy
models_new['rfc'] = RFC(max_depth=3,random_state=42) #accuracy
models_new['gbc'] = GBC(max_depth=1,random_state=42) #accuracy
models_new['gnb'] = GNB()
models_new['svc'] = SVC(kernel='rbf', C=1, gamma=0.2)
kf = KFold(n_splits=3,shuffle=True,random_state=42)
for n in models_new:
  score=cross_val_score(models_new[n], Xs1, y_New, cv=kf, n_jobs=-1)
  print(f'{n}:{score.mean():.3%},  {score.std():.3%}')

lgr:80.962%,  1.218%
knn:76.564%,  1.095%
dtc:85.017%,  0.830%
rfc:85.017%,  1.121%
gbc:84.674%,  1.121%
gnb:80.825%,  1.347%
svc:78.694%,  0.591%


In [213]:

for n in models_new:
  kf = KFold(n_splits=10,shuffle=True,random_state=42)
  score=cross_val_score(models_new[n], Xs1, y_New, cv=kf, n_jobs=-1)
  print(f'{n}:{score.mean():.3%},  {score.std():.3%}')

lgr:81.853%,  2.128%
knn:76.905%,  2.627%
dtc:84.806%,  2.547%
rfc:84.944%,  2.371%
gbc:84.944%,  2.448%
gnb:81.374%,  2.790%
svc:79.104%,  1.567%


In [214]:
for n in models_new:
  kf = KFold(n_splits=5,shuffle=True,random_state=42)
  score=cross_val_score(models_new[n], Xs1, y_New, cv=kf, n_jobs=-1)
  print(f'{n}:{score.mean():.3%},  {score.std():.3%}')

lgr:81.168%,  1.528%
knn:75.670%,  1.095%
dtc:85.086%,  2.136%
rfc:84.948%,  2.335%
gbc:85.086%,  2.658%
gnb:80.825%,  2.510%
svc:79.038%,  0.896%
