<h2>CS 4780/5780 Final Project: </h2>
<h3>COVID-19 Hospitalizations Prediction for EU Countries</h3>

Names and NetIDs for your group members:

Emma Wang yw345
Mohamed Abdalla mja266
Nasredene Elyamani ne227

<h3>Introduction:</h3>

<p> The final project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The programming project provide templates for how to do this, and the most recent video lectures summarize some of the tricks you will need (e.g. feature normalization, feature construction). So, this final project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is predicting hospitalizations due to COVID-19. Although hospitalizations are directly related to COVID-19 cases, the different populations, timelines and reactionary measures of different EU countries result in different trends in hospitalization numbers. In this project you will bring the power of machine learning to make predictions for the country-level hospitalizations using COVID-19 age group case data and also previous hospitalization data. There will be two tasks, one will be a basic problem that will require you to use methods learned in class. The second task will be more difficult and will require some additional intuition and insight. <b>Please read the project description PDF file carefully and follow the instructions there. Also make sure you write your code and answers to all the questions in this Jupyter Notebook </b> </p>
<p>


![europe-second-wave-covid-promo-1604686277132-superJumbo.png](attachment:europe-second-wave-covid-promo-1604686277132-superJumbo.png)

<h2>Part 1: Basics</h2><p>

<h3>1.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [None]:
import os
import pandas as pd
import numpy as np
import sklearn

<h3>1.2 Accuracy and Mean Squared Error:</h3><p>
To measure your performance in the Kaggle Competition, we are using accuracy and mean squared error (MSE). As a recap, accuracy is the percent of labels you predict correctly and MSE is the average squared difference between the estimated values and the actual value. To measure this, you can use library functions from sklearn. A simple example is shown below.
<p>

In [None]:
from sklearn.metrics import accuracy_score
y_pred = [3, 2, 1, 0, 1, 2, 3]
y_true = [0, 1, 2, 3, 1, 2, 3]
accuracy_score(y_true, y_pred)

0.42857142857142855

In [None]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_true, y_pred)

2.857142857142857

<h2>Part 2: Baseline Solution</h2><p>
Note that your code should be commented well and in part 2.4 you can refer to your comments.

<h3>2.1 Preprocessing and Feature Extraction:</h3><p>
Given the training dataset and graph information, you need to correctly preprocess the dataset (e.g. feature normalization). Think of what modifications can be done to the data to make it more easily interpretable.
<p>

In [None]:
#Reading in the data from csv file
df = pd.read_csv('/Users/nanboo/Desktop/CS 4780/finalproj/datasets/train_baseline.csv', sep=',',header=None, encoding='unicode_escape')

#making the first row the column name
row=df.iloc[0]
df.columns=row
df=df.drop([0])

#grabbing all the different countrie
countries=np.unique(df["country"])

#This is where we standardize the data by country and then by columns
num_columns=["Daily hospital occupancy","under_15_cases","15-24_cases", "25-49_cases", "50-64_cases", "65-79_cases", "over_80_cases"]
country_list=[]
country_mean=[]
country_stddev=[]
for i in countries:
    country=df[df["country"]==i]
    cases=country[num_columns].astype(float)
    mean=np.mean(cases)
    country_mean.append(mean)
    stddev=np.std(cases)
    country_stddev.append(stddev)
    country[num_columns]=(cases-mean)/stddev
    country_list.append(country)

#put it all back together into the data set
df=pd.concat(country_list)

#one hot encoding the countries
one_hot_country=pd.get_dummies(df.country, prefix='Code', drop_first = True)

#add the features to the feature vector
df=pd.concat([df, one_hot_country], axis=1)

#drop unnecessary features
df_dropped = df.drop(columns=["country", "date", "year_week"])

country_mean

df_dropped

df

Unnamed: 0,country,date,year_week,Daily hospital occupancy,under_15_cases,15-24_cases,25-49_cases,50-64_cases,65-79_cases,over_80_cases,...,Code_Estonia,Code_Iceland,Code_Ireland,Code_Italy,Code_Lithuania,Code_Netherlands,Code_Norway,Code_Portugal,Code_Slovenia,Code_Spain
1,Belgium,3/15/2020,2020-W11,-0.956441,-0.605323,-0.573126,-0.597310,-0.606753,-0.658021,-0.800440,...,0,0,0,0,0,0,0,0,0,0
2,Belgium,3/16/2020,2020-W12,-0.904203,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,...,0,0,0,0,0,0,0,0,0,0
3,Belgium,3/17/2020,2020-W12,-0.844155,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,...,0,0,0,0,0,0,0,0,0,0
4,Belgium,3/18/2020,2020-W12,-0.769460,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,...,0,0,0,0,0,0,0,0,0,0
5,Belgium,3/19/2020,2020-W12,-0.674748,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3981,Spain,2/19/2021,2021-W07,-0.125204,-0.894716,-1.226669,-1.179369,-1.134735,-1.129636,-1.262420,...,0,0,0,0,0,0,0,0,0,1
3982,Spain,2/22/2021,2021-W08,-0.280975,-1.147980,-1.374470,-1.331279,-1.258634,-1.262151,-1.430551,...,0,0,0,0,0,0,0,0,0,1
3983,Spain,2/24/2021,2021-W08,-0.522237,-1.147980,-1.374470,-1.331279,-1.258634,-1.262151,-1.430551,...,0,0,0,0,0,0,0,0,0,1
3984,Spain,2/25/2021,2021-W08,-0.615756,-1.147980,-1.374470,-1.331279,-1.258634,-1.262151,-1.430551,...,0,0,0,0,0,0,0,0,0,1


<h3>2.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 1.1.

In [None]:
#reading in the test data
dftest = pd.read_csv('/Users/nanboo/Desktop/CS 4780/finalproj/datasets/test_baseline_no_label.csv', sep=',',header=None, encoding='unicode_escape')
row=dftest.iloc[0]
dftest.columns=row
dftest=dftest.drop([0])

#standardize the test data
num_columns=["Daily hospital occupancy","under_15_cases","15-24_cases", "25-49_cases", "50-64_cases", "65-79_cases", "over_80_cases"]
country_list=[]
j=0
for i in countries:
    country=dftest[dftest["country"]==i]
    cases=country[num_columns].astype(float)
    country[num_columns]=(cases-country_mean[j])/country_stddev[j]
    country_list.append(country)
    j+=1

dftest=pd.concat(country_list)
#one hot encode the countrys and add them to the feature vector
one_hot_country=pd.get_dummies(dftest.country, prefix='Code', drop_first = True)
dftest=pd.concat([dftest, one_hot_country], axis=1)
#drop unneccesary features
dftest_dropped=dftest.drop(columns=["country", "date", "year_week"])

#split the trainung into features and labels
ylabels = df_dropped['next_week_increase_decrease']
df_dropped_nolabel = df_dropped.drop(columns=["next_week_increase_decrease"])

#SVM
#used the the SVM learning method (explained in report)
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
dfnoyoungins = df_dropped_nolabel.drop(columns=["under_15_cases","15-24_cases"])
X = dfnoyoungins
y = ylabels
from sklearn.svm import SVC
clf = SVC(C=100,kernel='rbf')
clf.fit(X,y)
dfnoyounginstest = dftest_dropped.drop(columns=["under_15_cases","15-24_cases"])
predictionssvm = clf.predict(dfnoyounginstest)
paramsSVM = clf.get_params()


#randomforest
#used the the randomforest learning method (explained in report)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X = df_dropped_nolabel
y = ylabels
clfRF = RandomForestClassifier(random_state=0)
clfRF.fit(X,y)
predictionsdeep = clfRF.predict(dftest_dropped)

#creating submission dataframe
dfsub = pd.DataFrame()
dfsub['country_id'] = dftest['country'] +' '+ dftest['date']
dfsub['next_week_increase_decrease'] = predictionssvm

predictionssvm
dftest_dropped
dfsub

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,country_id,next_week_increase_decrease
1,Belgium 9/7/2020,1
2,Belgium 9/8/2020,1
3,Belgium 9/9/2020,1
4,Belgium 9/10/2020,1
5,Belgium 9/11/2020,1
...,...,...
1140,Spain 3/5/2021,0
1141,Spain 3/8/2021,0
1142,Spain 3/9/2021,0
1143,Spain 3/10/2021,0


<h3>2.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [None]:
#Start by creating short list of gammas and C values to fine tune predictions
from sklearn.model_selection import StratifiedShuffleSplit
percents = []
C_range = np.logspace(-2, 2, 3)
print(C_range)
gamma_range = np.logspace(-2, 1, 3)
print(gamma_range)
#started off by not dropping these features but then dropped (explained in report)
dfnoyoungins = df_dropped_nolabel.drop(columns=["under_15_cases","15-24_cases"])
X = dfnoyoungins
y = ylabels
#Used SSS to split data into 80% train and 20% 5 times and performed cross val
# to fine tune parameters for better predictions. (explained in report)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
sss.get_n_splits(X, y)
for train_index, test_index in sss.split(X, y):
    for i in range(np.size(C_range)):
        for j in range(np.size(gamma_range)):
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]

            #training and predictions
            clf = SVC(gamma = gamma_range[j],C=C_range[i],kernel='rbf')
            clf.fit(X_train,y_train)
            predictionssvm = clf.predict(X_test)

            #checking to see percentage correct
            ytestnp = y_test.to_numpy()
            totalsame = np.sum(ytestnp == predictionssvm)
            total = np.size(ytestnp)
            percent = (totalsame)/total
            percents.append(percent)

#finding maximum value and relating it to which gamma and C value it is
maximumper = max(percents)
maxindex = percents.index(maximumper)
print(maxindex)
perss = pd.DataFrame(percents)
perss

[1.e-02 1.e+00 1.e+02]
[ 0.01        0.31622777 10.        ]
TRAIN: [ 558  464 1515 ...  811 1474 2835] TEST: [3054 3089 1825 3776 3805 2838 3241 1340 1266 1077  255 2539 2316 3249
 1245 1860  599 2015 2981 3855 2006 3570 1823 2445 1505 2017 1552 3439
 2564 1516 3734  545  992  520 1216  768  105 2839  147 1554 2305 1742
 1027 1708 2493 2081 2476 1419 3933 2910 1234 3668 2892  463 3589 2337
 3428 2594 2980 1143 2029 2870 1088 2732  708  963 2370  363 1620 2688
  392 1584  191 2170  602 2718 2731 2205  361 3791 2943 1967 2010  115
 3696 2954 1779 2794 1961 2714 1994 1100  632 2117 3566 1389 3556 1991
 1558 3081 3814 3591 2513 2557 2628 3613 2970 2430 2389 3316 2068 2060
 1163 2787  724  410 3810 3290  667  139 2219 1699 3726 2740 3762  758
 3448  778 1621   46 1835 1697 3717 1789 3285 1684 2845  654 3289 2708
 2169 2507 2341 3777  450 1299  374  537 2384 2716  515 3263 1385  314
 3846 2983  467   84 3065 1454 3586 1574 3780 1930 3306 1871 1769 2304
  798  386 3825 3923  343  878 1952  5

TRAIN: [ 558  464 1515 ...  811 1474 2835] TEST: [3054 3089 1825 3776 3805 2838 3241 1340 1266 1077  255 2539 2316 3249
 1245 1860  599 2015 2981 3855 2006 3570 1823 2445 1505 2017 1552 3439
 2564 1516 3734  545  992  520 1216  768  105 2839  147 1554 2305 1742
 1027 1708 2493 2081 2476 1419 3933 2910 1234 3668 2892  463 3589 2337
 3428 2594 2980 1143 2029 2870 1088 2732  708  963 2370  363 1620 2688
  392 1584  191 2170  602 2718 2731 2205  361 3791 2943 1967 2010  115
 3696 2954 1779 2794 1961 2714 1994 1100  632 2117 3566 1389 3556 1991
 1558 3081 3814 3591 2513 2557 2628 3613 2970 2430 2389 3316 2068 2060
 1163 2787  724  410 3810 3290  667  139 2219 1699 3726 2740 3762  758
 3448  778 1621   46 1835 1697 3717 1789 3285 1684 2845  654 3289 2708
 2169 2507 2341 3777  450 1299  374  537 2384 2716  515 3263 1385  314
 3846 2983  467   84 3065 1454 3586 1574 3780 1930 3306 1871 1769 2304
  798  386 3825 3923  343  878 1952  584 1179 3903  261 3612  244  316
 1863 1189 1347 1617 2717 37

TRAIN: [ 558  464 1515 ...  811 1474 2835] TEST: [3054 3089 1825 3776 3805 2838 3241 1340 1266 1077  255 2539 2316 3249
 1245 1860  599 2015 2981 3855 2006 3570 1823 2445 1505 2017 1552 3439
 2564 1516 3734  545  992  520 1216  768  105 2839  147 1554 2305 1742
 1027 1708 2493 2081 2476 1419 3933 2910 1234 3668 2892  463 3589 2337
 3428 2594 2980 1143 2029 2870 1088 2732  708  963 2370  363 1620 2688
  392 1584  191 2170  602 2718 2731 2205  361 3791 2943 1967 2010  115
 3696 2954 1779 2794 1961 2714 1994 1100  632 2117 3566 1389 3556 1991
 1558 3081 3814 3591 2513 2557 2628 3613 2970 2430 2389 3316 2068 2060
 1163 2787  724  410 3810 3290  667  139 2219 1699 3726 2740 3762  758
 3448  778 1621   46 1835 1697 3717 1789 3285 1684 2845  654 3289 2708
 2169 2507 2341 3777  450 1299  374  537 2384 2716  515 3263 1385  314
 3846 2983  467   84 3065 1454 3586 1574 3780 1930 3306 1871 1769 2304
  798  386 3825 3923  343  878 1952  584 1179 3903  261 3612  244  316
 1863 1189 1347 1617 2717 37

TRAIN: [ 558  464 1515 ...  811 1474 2835] TEST: [3054 3089 1825 3776 3805 2838 3241 1340 1266 1077  255 2539 2316 3249
 1245 1860  599 2015 2981 3855 2006 3570 1823 2445 1505 2017 1552 3439
 2564 1516 3734  545  992  520 1216  768  105 2839  147 1554 2305 1742
 1027 1708 2493 2081 2476 1419 3933 2910 1234 3668 2892  463 3589 2337
 3428 2594 2980 1143 2029 2870 1088 2732  708  963 2370  363 1620 2688
  392 1584  191 2170  602 2718 2731 2205  361 3791 2943 1967 2010  115
 3696 2954 1779 2794 1961 2714 1994 1100  632 2117 3566 1389 3556 1991
 1558 3081 3814 3591 2513 2557 2628 3613 2970 2430 2389 3316 2068 2060
 1163 2787  724  410 3810 3290  667  139 2219 1699 3726 2740 3762  758
 3448  778 1621   46 1835 1697 3717 1789 3285 1684 2845  654 3289 2708
 2169 2507 2341 3777  450 1299  374  537 2384 2716  515 3263 1385  314
 3846 2983  467   84 3065 1454 3586 1574 3780 1930 3306 1871 1769 2304
  798  386 3825 3923  343  878 1952  584 1179 3903  261 3612  244  316
 1863 1189 1347 1617 2717 37

TRAIN: [1840 3241 1025 ...   63   19 1315] TEST: [1323 1198  925 1259 1730 1533  956 1659 2007 1103 2126  300 2173 3302
 3697 3053  610 2266 3424 1059  180  220 3752  947 1261 3595 1330 2472
 2293 1277 1631  411   33  692 2440 2777 1474 2386 1186 3824 1149  122
 2782 2608  575  211 1711  660  863  929 1326 3964  884 1597  247 3681
  880  576 1618  823 3010 2377 2075 2808 2033 2940 2775 3177 2556 2515
 1962 2956  783 1853 1199 1975 1372 3371  604 2384 2656 2008  964 1794
 3229 2142 3623  949  479 2933 3394 1343  689 2054 3413  694 1606 1806
 3327 1056   64 1518 2319 2405 2681 3184 2116  155 1332 2445  103 3638
 2659 1409 3182 2024  232 2507 3859 1849 2547  469 2032 3305  634   95
 3402 1018 3031 3742 2448 2269 3694 1257 2593 3065 1602 2503 1750 2119
 2962 3194 3383  625 3323  523 3462 1459 3486  213 2806    9 1005  849
 3823 3281 2235 2486  837  809 2899 3735 3204 1212  392 2295 2535  381
  826  691  304 3183 3342 2727 1495 3345 2450 2903  684  292 1966 1128
 2160 1519 3853   21  117 31

TRAIN: [1840 3241 1025 ...   63   19 1315] TEST: [1323 1198  925 1259 1730 1533  956 1659 2007 1103 2126  300 2173 3302
 3697 3053  610 2266 3424 1059  180  220 3752  947 1261 3595 1330 2472
 2293 1277 1631  411   33  692 2440 2777 1474 2386 1186 3824 1149  122
 2782 2608  575  211 1711  660  863  929 1326 3964  884 1597  247 3681
  880  576 1618  823 3010 2377 2075 2808 2033 2940 2775 3177 2556 2515
 1962 2956  783 1853 1199 1975 1372 3371  604 2384 2656 2008  964 1794
 3229 2142 3623  949  479 2933 3394 1343  689 2054 3413  694 1606 1806
 3327 1056   64 1518 2319 2405 2681 3184 2116  155 1332 2445  103 3638
 2659 1409 3182 2024  232 2507 3859 1849 2547  469 2032 3305  634   95
 3402 1018 3031 3742 2448 2269 3694 1257 2593 3065 1602 2503 1750 2119
 2962 3194 3383  625 3323  523 3462 1459 3486  213 2806    9 1005  849
 3823 3281 2235 2486  837  809 2899 3735 3204 1212  392 2295 2535  381
  826  691  304 3183 3342 2727 1495 3345 2450 2903  684  292 1966 1128
 2160 1519 3853   21  117 31

TRAIN: [1840 3241 1025 ...   63   19 1315] TEST: [1323 1198  925 1259 1730 1533  956 1659 2007 1103 2126  300 2173 3302
 3697 3053  610 2266 3424 1059  180  220 3752  947 1261 3595 1330 2472
 2293 1277 1631  411   33  692 2440 2777 1474 2386 1186 3824 1149  122
 2782 2608  575  211 1711  660  863  929 1326 3964  884 1597  247 3681
  880  576 1618  823 3010 2377 2075 2808 2033 2940 2775 3177 2556 2515
 1962 2956  783 1853 1199 1975 1372 3371  604 2384 2656 2008  964 1794
 3229 2142 3623  949  479 2933 3394 1343  689 2054 3413  694 1606 1806
 3327 1056   64 1518 2319 2405 2681 3184 2116  155 1332 2445  103 3638
 2659 1409 3182 2024  232 2507 3859 1849 2547  469 2032 3305  634   95
 3402 1018 3031 3742 2448 2269 3694 1257 2593 3065 1602 2503 1750 2119
 2962 3194 3383  625 3323  523 3462 1459 3486  213 2806    9 1005  849
 3823 3281 2235 2486  837  809 2899 3735 3204 1212  392 2295 2535  381
  826  691  304 3183 3342 2727 1495 3345 2450 2903  684  292 1966 1128
 2160 1519 3853   21  117 31

TRAIN: [ 788 2569 1174 ...  829 3620  411] TEST: [  34 3195 1740 1814 1590 3312 1480 1908 3785 2412 1187  506 3314 3141
 1788 2492 3595 1081 3335 1976 2374 3032 1162  236  295 2695 3699 3782
 1559 1368  565 2094  817 2626 2699 1226 1494 3161 1829 3295 1006 2937
 1779 1343  631 2245  587 1383 3330 2299 1651 1472 1185  450 2870 3265
  103 3757 3289 1436 2414 2909  288 1305 3364 3596  909 2340  700 3815
 3365 2404 1522 1225 2679 2186 3923 2918  617  531 3668 1007 3395 1193
  443 3663  538  945 1223 1548 1884 1735  990  353  946 2851  217 3216
 2779 2618 2833 2275 1120  602 3734  370 1955 3919 2876 3981 1378 3085
  553 2628 3065 1550 2912 1778 3148  358 2339  750  354 1054 2290 3494
 1351 3657 2963 3914 2224  339 1307 3565 3390  722   43 2974  436 1470
 2162  105   49 2288 3722  605  920 1039 2321  363  183 1783 1558  380
 3424 2659 2707 2037  704 1876 3371 2471 1406 1493 2256 1938  813 3533
 2199 3980 2625  355 1687  106  756 1060 2182  284  861  168 1992 1349
 1149 1456 3096 1991 3716  5

TRAIN: [ 788 2569 1174 ...  829 3620  411] TEST: [  34 3195 1740 1814 1590 3312 1480 1908 3785 2412 1187  506 3314 3141
 1788 2492 3595 1081 3335 1976 2374 3032 1162  236  295 2695 3699 3782
 1559 1368  565 2094  817 2626 2699 1226 1494 3161 1829 3295 1006 2937
 1779 1343  631 2245  587 1383 3330 2299 1651 1472 1185  450 2870 3265
  103 3757 3289 1436 2414 2909  288 1305 3364 3596  909 2340  700 3815
 3365 2404 1522 1225 2679 2186 3923 2918  617  531 3668 1007 3395 1193
  443 3663  538  945 1223 1548 1884 1735  990  353  946 2851  217 3216
 2779 2618 2833 2275 1120  602 3734  370 1955 3919 2876 3981 1378 3085
  553 2628 3065 1550 2912 1778 3148  358 2339  750  354 1054 2290 3494
 1351 3657 2963 3914 2224  339 1307 3565 3390  722   43 2974  436 1470
 2162  105   49 2288 3722  605  920 1039 2321  363  183 1783 1558  380
 3424 2659 2707 2037  704 1876 3371 2471 1406 1493 2256 1938  813 3533
 2199 3980 2625  355 1687  106  756 1060 2182  284  861  168 1992 1349
 1149 1456 3096 1991 3716  5

TRAIN: [ 788 2569 1174 ...  829 3620  411] TEST: [  34 3195 1740 1814 1590 3312 1480 1908 3785 2412 1187  506 3314 3141
 1788 2492 3595 1081 3335 1976 2374 3032 1162  236  295 2695 3699 3782
 1559 1368  565 2094  817 2626 2699 1226 1494 3161 1829 3295 1006 2937
 1779 1343  631 2245  587 1383 3330 2299 1651 1472 1185  450 2870 3265
  103 3757 3289 1436 2414 2909  288 1305 3364 3596  909 2340  700 3815
 3365 2404 1522 1225 2679 2186 3923 2918  617  531 3668 1007 3395 1193
  443 3663  538  945 1223 1548 1884 1735  990  353  946 2851  217 3216
 2779 2618 2833 2275 1120  602 3734  370 1955 3919 2876 3981 1378 3085
  553 2628 3065 1550 2912 1778 3148  358 2339  750  354 1054 2290 3494
 1351 3657 2963 3914 2224  339 1307 3565 3390  722   43 2974  436 1470
 2162  105   49 2288 3722  605  920 1039 2321  363  183 1783 1558  380
 3424 2659 2707 2037  704 1876 3371 2471 1406 1493 2256 1938  813 3533
 2199 3980 2625  355 1687  106  756 1060 2182  284  861  168 1992 1349
 1149 1456 3096 1991 3716  5

TRAIN: [1301 3090 3732 ... 1336 1870 3525] TEST: [2281 2362 3056 3542  902  155 3275  232 1531 2016 2800  187  895 1155
 3846 1208   16 2645 1143 2190  306 1969 2074 2222 2157 1946 1020 1354
 1097 2334  262  690  136 1874  816 3400 1663 3803 1728 3734 1671 3864
 3233  612 1312 3334  203  234 3120 2129  965 2715  814 2257  617  551
  141 3921 2342 1701 1691 3309 2519 3728 3621 1422 1650 1987 1798 2363
 2879 1605 3569  685 2004 1530 2853 3894 2535 1872 1206 3167 2567  455
 3827 2798 1873 2195  197 3601 1255 3678 1300 3639 3033 1059 3411 1067
 1034 1457 1868 2607 3391  491 2033    8  132 3093 1274 3013 1787 2763
 3436 3136  376 3514  277  434 2148 1270 2682 3216 1989 1915 2172 3296
 2125 2632 2789 3608 2963 1604 2510 2948    4 2090 2178  505 2799 2594
 1338 3132 3160 2821 3067 2230 1159 3550  622 1352  463  916 1822  387
 1919 2698 2897 2977 1158 3260 2710 2577 2081  630  737 2118 1940  795
 1556 2274  546 1296  473 3086  201 1520 1488  475 3790 3935 1377  789
  609 2908 3225 1762 1756 27

TRAIN: [1301 3090 3732 ... 1336 1870 3525] TEST: [2281 2362 3056 3542  902  155 3275  232 1531 2016 2800  187  895 1155
 3846 1208   16 2645 1143 2190  306 1969 2074 2222 2157 1946 1020 1354
 1097 2334  262  690  136 1874  816 3400 1663 3803 1728 3734 1671 3864
 3233  612 1312 3334  203  234 3120 2129  965 2715  814 2257  617  551
  141 3921 2342 1701 1691 3309 2519 3728 3621 1422 1650 1987 1798 2363
 2879 1605 3569  685 2004 1530 2853 3894 2535 1872 1206 3167 2567  455
 3827 2798 1873 2195  197 3601 1255 3678 1300 3639 3033 1059 3411 1067
 1034 1457 1868 2607 3391  491 2033    8  132 3093 1274 3013 1787 2763
 3436 3136  376 3514  277  434 2148 1270 2682 3216 1989 1915 2172 3296
 2125 2632 2789 3608 2963 1604 2510 2948    4 2090 2178  505 2799 2594
 1338 3132 3160 2821 3067 2230 1159 3550  622 1352  463  916 1822  387
 1919 2698 2897 2977 1158 3260 2710 2577 2081  630  737 2118 1940  795
 1556 2274  546 1296  473 3086  201 1520 1488  475 3790 3935 1377  789
  609 2908 3225 1762 1756 27

TRAIN: [1301 3090 3732 ... 1336 1870 3525] TEST: [2281 2362 3056 3542  902  155 3275  232 1531 2016 2800  187  895 1155
 3846 1208   16 2645 1143 2190  306 1969 2074 2222 2157 1946 1020 1354
 1097 2334  262  690  136 1874  816 3400 1663 3803 1728 3734 1671 3864
 3233  612 1312 3334  203  234 3120 2129  965 2715  814 2257  617  551
  141 3921 2342 1701 1691 3309 2519 3728 3621 1422 1650 1987 1798 2363
 2879 1605 3569  685 2004 1530 2853 3894 2535 1872 1206 3167 2567  455
 3827 2798 1873 2195  197 3601 1255 3678 1300 3639 3033 1059 3411 1067
 1034 1457 1868 2607 3391  491 2033    8  132 3093 1274 3013 1787 2763
 3436 3136  376 3514  277  434 2148 1270 2682 3216 1989 1915 2172 3296
 2125 2632 2789 3608 2963 1604 2510 2948    4 2090 2178  505 2799 2594
 1338 3132 3160 2821 3067 2230 1159 3550  622 1352  463  916 1822  387
 1919 2698 2897 2977 1158 3260 2710 2577 2081  630  737 2118 1940  795
 1556 2274  546 1296  473 3086  201 1520 1488  475 3790 3935 1377  789
  609 2908 3225 1762 1756 27

TRAIN: [1475 3805 3372 ... 2343 3468 2396] TEST: [1548 1106 3441  650  626 1474 3032 1529 2292 1934 2648 1140  936  940
 3363 3887 3496  850 2440 2691 1309 3884 3091 1380  514   48 3523 2346
 2437  342 1117 3547 2962 2044    0  243  500 3848 2079 3259 1825 3371
  446 2230 1553  198  493 1482 1948 3236 2513  410 2783  174  953 1841
 3175 2982  212 2773 3314 1397 3390 3089 1204  995 3824 1884 2968  729
   22 3878 3549 3899 2209  769 1148 3763 2608 2073 3898  906 2900  679
 3101  102 2657 2282 2879 1446 2096  156  662 1997 3345 1959 3743 2468
 3090 1552 2415 3690  560 1786 3277 2579 2760  178 2111 3054 1539  567
 2187 1420  141 2100 2305 1422  993 3832 1549 3854    3 3459 3200  217
  147 1749 3218 2163  179  374 3730 3782  877   47 2818 3482 2092  352
 2418  458 2772 3949  185 3052 3648 2810 1889 2024  335 3010  478 3294
 2724 2687 3544 2264  979 1052 3076 2990 1410 2291 1067 2789 3787 3133
 1259 1752 1706 2616 3472 2758 1936   83 2910 3031  258  437  654 2801
 3318 2075  555  205 3882 31

TRAIN: [1475 3805 3372 ... 2343 3468 2396] TEST: [1548 1106 3441  650  626 1474 3032 1529 2292 1934 2648 1140  936  940
 3363 3887 3496  850 2440 2691 1309 3884 3091 1380  514   48 3523 2346
 2437  342 1117 3547 2962 2044    0  243  500 3848 2079 3259 1825 3371
  446 2230 1553  198  493 1482 1948 3236 2513  410 2783  174  953 1841
 3175 2982  212 2773 3314 1397 3390 3089 1204  995 3824 1884 2968  729
   22 3878 3549 3899 2209  769 1148 3763 2608 2073 3898  906 2900  679
 3101  102 2657 2282 2879 1446 2096  156  662 1997 3345 1959 3743 2468
 3090 1552 2415 3690  560 1786 3277 2579 2760  178 2111 3054 1539  567
 2187 1420  141 2100 2305 1422  993 3832 1549 3854    3 3459 3200  217
  147 1749 3218 2163  179  374 3730 3782  877   47 2818 3482 2092  352
 2418  458 2772 3949  185 3052 3648 2810 1889 2024  335 3010  478 3294
 2724 2687 3544 2264  979 1052 3076 2990 1410 2291 1067 2789 3787 3133
 1259 1752 1706 2616 3472 2758 1936   83 2910 3031  258  437  654 2801
 3318 2075  555  205 3882 31

TRAIN: [1475 3805 3372 ... 2343 3468 2396] TEST: [1548 1106 3441  650  626 1474 3032 1529 2292 1934 2648 1140  936  940
 3363 3887 3496  850 2440 2691 1309 3884 3091 1380  514   48 3523 2346
 2437  342 1117 3547 2962 2044    0  243  500 3848 2079 3259 1825 3371
  446 2230 1553  198  493 1482 1948 3236 2513  410 2783  174  953 1841
 3175 2982  212 2773 3314 1397 3390 3089 1204  995 3824 1884 2968  729
   22 3878 3549 3899 2209  769 1148 3763 2608 2073 3898  906 2900  679
 3101  102 2657 2282 2879 1446 2096  156  662 1997 3345 1959 3743 2468
 3090 1552 2415 3690  560 1786 3277 2579 2760  178 2111 3054 1539  567
 2187 1420  141 2100 2305 1422  993 3832 1549 3854    3 3459 3200  217
  147 1749 3218 2163  179  374 3730 3782  877   47 2818 3482 2092  352
 2418  458 2772 3949  185 3052 3648 2810 1889 2024  335 3010  478 3294
 2724 2687 3544 2264  979 1052 3076 2990 1410 2291 1067 2789 3787 3133
 1259 1752 1706 2616 3472 2758 1936   83 2910 3031  258  437  654 2801
 3318 2075  555  205 3882 31

Unnamed: 0,0
0,0.565872
1,0.568381
2,0.565872
3,0.691343
4,0.785445
5,0.853199
6,0.762861
7,0.85069
8,0.900878
9,0.565872


<h3>2.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

2.4.1 How did you preprocess the dataset and features, and how did you formulate the learning problem (or problems)?

2.4.2 Which two learning methods from class did you choose and why did you made the choices?

2.4.3 How did you do the model selection?

2.4.4 Does the test performance reach a given baseline 70% performance? (Please include a screenshot of Kaggle Submission)

2.4.1 How did you preprocess the dataset and features, and how did you formulate the learning problem (or problems)?

In order to obtain the best results possible, we had to carefully and strategically plan out the pre-processing of our data. We started off by reading the training CSV file into a pandas data frame and each column became a feature for our feature vector. We decided to normally standardize our training data by country then by column. This was done by splitting each country into groups then taking the mean of each column from the feature Daily Hospital Occupancy all the way through to over_80 cases, and dividing by the standard deviation of that country's column. We chose to standardize our data this way, since each country has different populations. To account for this difference, we felt that it made sense to group the data based on country and then take the mean of each feature vector within each country. We then concatenated the data back together to form a valid test set. One very important piece of data was the country, presumably because of the different resources and population of each country, so we decided to make the country name into a feature, and this was done by one-hot encoding. There were an additional 14 features that were a binary representation of which country it is. We then dropped the dates, country names, and week, as we proved this wouldn't be effective since we are modeling a binary classification problem, not a regression problem in the baseline task.

2.4.2 Which two learning methods from class did you choose and why did you made the choices?

The two learning methods that we used were SVM and random forest. SVM was chosen since we believed that it would be a reliable classifier since we are determining a binary classification, either a zero for no increase in next week hospitalizations or a one for an increase. SVM would also give us more flexibility in our dataset as opposed to a hard margin SVM and would also allow our model to generalize more. Due to the hot encoding we had decided to do for the countries, we also believed that an SVM decision boundary would be able to correctly classify a majority of labels. One hot-encoding allows for a grouping of the same country leading to a clustering effect. Also, an rbf kernel was used in order to train the training data since a form of mean clustering was believed to occur as a result of our standardization. Initially, a default C and gamma value were chosen to perform learning. As for Random Forest, we had believed that a decision tree would be rather effective since we are dealing with a binary classification problem. The random forest classifier also allowed us to use an ensemble-based method which created trees based on the best groupings of sub-samples while using averaging to predictions and control over-fitting. Since we had already hit the baseline, we decided to tweak some of Random Forest’s parameters and run it through Kaggle like maximum number features to consider when looking for the best split, maximum_features and the max number of samples to draw, max_samples.


2.4.3 How did you do the model selection?

We first decided to proceed with the SVM learning algorithm. The model selection was done by creating a list of gamma and C values that could possibly be used to best predict the validation set as a result of the training set. The C values ranged from .01 to 100 in order to prevent both overfitting and underfitting of the data. The gamma range was chosen similarly, however, we had realized the gammas would vary from test to validation to training sets. In order to split the data evenly, we had decided to use a Stratified Shuffle Split, which split the data across all countries, and then split the entire data set into 80% training and 20% validation (test), five separate times. A double for loop was executed on each training set to determine the best combination of gamma and C values. At the end of this cross-validation, a list of accuracy percentages was printed out, indicating a C value of 100 and gamma value of 10 had come victorious. These values were then inserted into our learning model and a 68% accuracy was achieved. We were quite confused why we weren't hitting baseline since we achieved almost 91% accuracy on the validation set! We realized we are running into an overfitting problem as a result of the gamma parameter. We had decided to keep the same C value as it proved to lead to better consistent results in our cross-validation and use the “scaled” gamma feature provided by the SKLearn library. This scaled feature allowed us to use a nuanced version of the original definition of gamma we had learned in the course. Rather than 1/# of features, it scaled it by the average of the training set. Once this was done, we had achieved a 74% accuracy on kaggle. This proved that the gamma parameter wasn't very important in our predictions as it would change from set to set as stated earlier. In order to further improve our accuracy, we had decided to do a form of backward feature selection by removing features that could have been deemed as unnecessary. Our intuition led us to start by removing the under 15 feature, and after cross-validation, this proved to be effective. We then decided to take away the 15-24 age range and this proved to be even more effective! The reason why these features were removed was a form of feature pruning. We felt as if these age ranges did little to represent hospital occupancy as several kids within this age range won't necessarily go to the hospital if they had contracted COVID. Upon submission onto Kaggle with this new learning method, we had achieved a 79% accuracy on the test set.




2.4.4 Does the test performance reach a given baseline 70% performance? (Please include a screenshot of Kaggle Submission)

Yes!
![Screen%20Shot%202021-05-14%20at%206.46.45%20PM.png](attachment:Screen%20Shot%202021-05-14%20at%206.46.45%20PM.png)


<h2>Part 3: Creative Solution</h2><p>

<h3>3.1 Open-ended Code:</h3><p>
You may follow the steps in part 2 again but making innovative changes like creating/using new features, using new training algorithms, etc. Make sure you explain everything clearly in part 3.2. Note that reaching the 150k MSE creative baseline is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [None]:
#reading in the training creative data set
df = pd.read_csv('/Users/nanboo/Desktop/CS 4780/finalproj/datasets/train_creative.csv', sep=',',header=None, encoding='unicode_escape')

#Make the first row the column labels
row=df.iloc[0]
df.columns=row
df=df.drop([0])



#extract the unique countries
countries=np.unique(df["country"])

#extract the unique countries and satnderdize for each category in num_columns
num_columns=["Daily hospital occupancy","under_15_cases","15-24_cases", "25-49_cases", "50-64_cases", "65-79_cases", "over_80_cases"]
country_list=[]
country_mean=[]
country_stddev=[]
for i in countries:
    country=df[df["country"]==i]
    cases=country[num_columns].astype(float)
    mean=np.mean(cases)
    country_mean.append(mean)
    stddev=np.std(cases)
    country_stddev.append(stddev)
    country[num_columns]=(cases-mean)/stddev
    country_list.append(country)



df=pd.concat(country_list)

#encode the dates linearly starting from Jan 1st 2020
df["date"]=pd.to_datetime(df["date"])

year2020 = df[df["date"].dt.year == 2020]
year2021 = df[df["date"].dt.year == 2021]

year2020 = year2020["date"].dt.dayofyear
year2021 = year2021["date"].dt.dayofyear

year2020 = year2020.astype(int)
year2021 = year2021.astype(int)
year2021 = year2021 + 366

alldates = pd.concat([year2020,year2021])
alldates = alldates.sort_index()
df["date"] = alldates

#drop unneccesary features
df_dropped_creative = df.drop(columns=["year_week"])
df_dropped_creative = df_dropped_creative.sort_index()

country_mean

df_dropped_creative

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,country,date,Daily hospital occupancy,under_15_cases,15-24_cases,25-49_cases,50-64_cases,65-79_cases,over_80_cases,next_week_hospitalizations
1,Belgium,75,-0.956441,-0.605323,-0.573126,-0.597310,-0.606753,-0.658021,-0.800440,1661
2,Belgium,76,-0.904203,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,1883
3,Belgium,77,-0.844155,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,2204
4,Belgium,78,-0.769460,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,2721
5,Belgium,79,-0.674748,-0.601047,-0.565149,-0.559340,-0.566673,-0.589336,-0.711627,3111
...,...,...,...,...,...,...,...,...,...,...
3981,Spain,416,-0.125204,-0.894716,-1.226669,-1.179369,-1.134735,-1.129636,-1.262420,10200
3982,Spain,419,-0.280975,-1.147980,-1.374470,-1.331279,-1.258634,-1.262151,-1.430551,9896
3983,Spain,421,-0.522237,-1.147980,-1.374470,-1.331279,-1.258634,-1.262151,-1.430551,9761
3984,Spain,422,-0.615756,-1.147980,-1.374470,-1.331279,-1.258634,-1.262151,-1.430551,9381


In [None]:
#reading in the testing creative data set
dftest_creative = pd.read_csv('/Users/nanboo/Desktop/CS 4780/finalproj/datasets/test_creative_no_label.csv', sep=',',header=None, encoding='unicode_escape')
#Make the first row the column labels
row=dftest_creative.iloc[0]
dftest_creative.columns=row
dftest_creative=dftest_creative.drop([0])

#extract the unique countries and standerdize for each category in num_columns
num_columns=["Daily hospital occupancy","under_15_cases","15-24_cases", "25-49_cases", "50-64_cases", "65-79_cases", "over_80_cases"]
country_list=[]
j=0
for i in countries:
    country=dftest_creative[dftest_creative["country"]==i]
    cases=country[num_columns].astype(float)
    country[num_columns]=(cases-country_mean[j])/country_stddev[j]
    country_list.append(country)
    j+=1

dftest_creative=pd.concat(country_list)

#create submission data frame
dfsub_creative = pd.DataFrame()
dfsub_creative['country_id'] = dftest_creative['country'] +' '+ dftest_creative['date']

#encode the dates linearly starting from Jan 1st 2020
dftest_creative["date"]=pd.to_datetime(dftest_creative["date"])

year2020 = dftest_creative[dftest_creative["date"].dt.year == 2020]
year2021 = dftest_creative[dftest_creative["date"].dt.year == 2021]

year2020 = year2020["date"].dt.dayofyear
year2021 = year2021["date"].dt.dayofyear

year2020 = year2020.astype(int)
year2021 = year2021.astype(int)
year2021 = year2021 + 366

alldates = pd.concat([year2020,year2021])
alldates = alldates.sort_index()
dftest_creative["date"] = alldates

#drop unnecessary features from data set
dftest_creative = dftest_creative.sort_index()
dftest_dropped_crea=dftest_creative.drop(columns=["year_week"])

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

allpredictions = []

#train using an ensemble method by splitting each country and training a linear regression
for i in countries:
    countrytest = dftest_dropped_crea[dftest_dropped_crea["country"]==i]
    countrytrain = df_dropped_creative[df_dropped_creative["country"]==i]
    y =  countrytrain['next_week_hospitalizations']
    # first used dates, then realized negative effect and dropped it (explained in report)
    X = countrytrain.drop(columns=["next_week_hospitalizations","country","date"])
    reg = LinearRegression().fit(X, y)
    countrytest = countrytest.drop(columns=["country","date"])
    predictionslogistic = reg.predict(countrytest)
    allpredictions.append(predictionslogistic)

#get all predictions into one place
predscolumn = np.concatenate(allpredictions)

dfsub_creative['next_week_hospitalizations'] = predscolumn
dfsub_creative

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,country_id,next_week_hospitalizations
1,Belgium 9/7/2020,695.560663
2,Belgium 9/8/2020,718.967479
3,Belgium 9/9/2020,709.604752
4,Belgium 9/10/2020,708.668480
5,Belgium 9/11/2020,713.349843
...,...,...
1140,Spain 3/5/2021,7707.013396
1141,Spain 3/8/2021,7433.394316
1142,Spain 3/9/2021,7305.232717
1143,Spain 3/10/2021,7215.856865


In [None]:
#CROSS VALIDATION FOR OTHER LEARNING METHODS DESCRIBED IN THE REPORT
# ******************VALIDATION TESTING************************************
# Similar approach to training and testing except with 70% of training to train
# and 30% of training to test

dfnew = pd.read_csv('/Users/nanboo/Desktop/CS 4780/finalproj/datasets/train_creative.csv', sep=',',header=None, encoding='unicode_escape')

# Making first row column row
row=dfnew.iloc[0]
dfnew.columns=row
dfnew=dfnew.drop([0])

# Extract the unique countries
countries=np.unique(dfnew["country"])

# Split training set, df_dropped_creative into randomized 70% and 30% sets
dfnew70 = dfnew.sample(frac=0.70)
dfnew30 = dfnew.drop(dfnew70.index)


#TRAINING DATA: extract the unique countries and get the average for each category in num_columns + date encoding
num_columns=["Daily hospital occupancy","under_15_cases","15-24_cases", "25-49_cases", "50-64_cases", "65-79_cases", "over_80_cases"]
country_list=[]
country_mean=[]
country_stddev=[]
for i in countries:
    country=dfnew70[dfnew70["country"]==i]
    cases=country[num_columns].astype(float)
    mean=np.mean(cases)
    country_mean.append(mean)
    stddev=np.std(cases)
    country_stddev.append(stddev)
    country[num_columns]=(cases-mean)/stddev
    country_list.append(country)

dfnew70 = pd.concat(country_list)

#Date Encoding
dfnew70["date"]=pd.to_datetime(dfnew70["date"])

newyear2020 = dfnew70[dfnew70["date"].dt.year == 2020]
newyear2021 = dfnew70[dfnew70["date"].dt.year == 2021]

newyear2020 = newyear2020["date"].dt.dayofyear
newyear2021 = newyear2021["date"].dt.dayofyear

newyear2020 = newyear2020.astype(int)
newyear2021 = newyear2021.astype(int)
newyear2021 = newyear2021 + 366

newalldates = pd.concat([newyear2020,newyear2021])
newalldates = newalldates.sort_index()
dfnew70["date"] = newalldates

dfnew70 = dfnew70.drop(columns=["year_week"])
dfnew70 = dfnew70.sort_index()



#TESTING DATA: extract the unique countries and get the average for each category in num_columns + date encoding
num_columns=["Daily hospital occupancy","under_15_cases","15-24_cases", "25-49_cases", "50-64_cases", "65-79_cases", "over_80_cases"]
country_list=[]
j=0
for i in countries:
    country=dfnew30[dfnew30["country"]==i]
    cases=country[num_columns].astype(float)
    country[num_columns]=(cases-country_mean[j])/country_stddev[j]
    country_list.append(country)
    j+=1

dfnew30=pd.concat(country_list)

df_final = pd.DataFrame()
df_final['country_id'] = dfnew30['country'] +' '+ dfnew30['date']

#Date Encoding
dfnew30["date"]=pd.to_datetime(dfnew30["date"])

year2020 = dfnew30[dfnew30["date"].dt.year == 2020]
year2021 = dfnew30[dfnew30["date"].dt.year == 2021]

year2020 = year2020["date"].dt.dayofyear
year2021 = year2021["date"].dt.dayofyear

year2020 = year2020.astype(int)
year2021 = year2021.astype(int)
year2021 = year2021 + 366

alldates = pd.concat([year2020,year2021])
alldates = alldates.sort_index()
dfnew30["date"] = alldates

dfnew30 = dfnew30.sort_index()
dfnew30 = dfnew30.drop(columns=["year_week"])
allvals = dfnew30.to_numpy()
msetrue = allvals[:,len(allvals[0])-1]
dfnew30 = dfnew30.drop(columns=["next_week_hospitalizations"])

# ENSEMBLE TRAINING AND TESTING WITH RIDGE REGRESSION MODEL
#
# Ridge Regression performed optimally for the models we tested containing
# hyperparameters
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
allpredictions = []

for j in [0.1,0.5,1.0,10,100,300]:
  for i in countries:
      countrytest = dfnew30[dfnew30["country"]==i]
      countrytrain = dfnew70[dfnew70["country"]==i]
      y =  countrytrain['next_week_hospitalizations']
      X = countrytrain.drop(columns=["next_week_hospitalizations","country"])
      clf = Ridge(alpha=j,solver='auto').fit(X,y)
      countrytest = countrytest.drop(columns=["country"])
      predictionslogistic = clf.predict(countrytest)
      allpredictions.append(predictionslogistic)

  predscolumn = np.concatenate(allpredictions)
  msepredict = predscolumn
  print(j)
  print(mean_squared_error(msetrue,msepredict))
  allpredictions = []

# ************************END OF VALIDATION


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


0.1
212871.0572457162
0.5
223036.11532726808
1.0
231030.6129314636
10
303957.4367156334
100
1069601.4117406877
300
2865099.401665342


<h3>3.2 Explanation in Words:</h3><p>

You need to answer the following questions in a markdown cell after this cell:

3.2.1 How much did you manage to improve performance on the test set? Did you reach the 150k MSE for the test in Kaggle? (Please include a screenshot of Kaggle Submission)

3.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

3.2.1 How much did you manage to improve performance on the test set? Did you reach the 150k MSE for the test in Kaggle? (Please include a screenshot of Kaggle Submission)

Our very first attempt came with using Random Forest in which we got a mean-squared error of 5.9 million on Kaggle. We managed to improve our performance on the test set all the up to a mean-squared error of 147k and reach the 150k MSE goal.

![Screen%20Shot%202021-05-14%20at%206.43.29%20PM.png](attachment:Screen%20Shot%202021-05-14%20at%206.43.29%20PM.png)

3.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.
Creative ideas:

Initially, we started off by taking one of our best models from the baseline portion of the project, Random Forest, just to see where we were at and how to proceed from there. Our initial score was over 5 million, and realized we would need a more sound predictor for the number of next_week_hospitalizations. Without any modifications to our preprocessing approach, we decided to feed our training and testing data as is into a Gradient Boosting Regressor. Our score improved to 890k, which was progress but we realized we would need to change our approach as far as training our model. We did some research as to how we could improve the performance of models we had already tested through Kaggle like Gradient Boost, Random Forest, and Ridge Regression with their hyperparameters and learned that we can implement cross validation. We implemented this by first grabbing a random 70% for our training data for validation testing and 30% for our new training set. We then preprocessed the data similarly to what we did in our baseline with standardizing the mean and standard deviation for features in our 70% training set and using that same mean and standard deviation to standardize our 30% validation test set. By encapsulating our ensemble method inside a for-loop, we passed in multiple hyperparameter values for alpha(ridge/lasso), C(SVM/SVR), etc. and checked how varying different hyperparameters improves our performance on the validation set in addition to comparing different models. Based on our intuition, we assumed each country would be affected by covid differently and have varying approaches, resources, etc., so we decided to use an ensemble approach to training our data. We got the list of countries and trained our models on each country to better capture the next_week_hospitalizations. Another note is that our Gradient Boosting model performed remarkably well on our validation test set, averaging a mean-squared error of around 75K. We were excited and thought we would for sure beat the 100k MSE goal, but got a MSE of 450k which meant our Gradient Boosting model was really overfitting. We then tried Linear Regression to model our ensemble approach, because we assumed the data was somewhat of a piecewise linear function (or the overall trend was upward) based on the provided graph and managed to verify its superior performance on our validation set. After integrating this model, we managed to get our MSE down to 200k but we were determined to take it a step further. Due to the performance of Linear Regression, we decided to try Ridge Regression and Lasso Regression. Ridge Regression performed well on our validation and we tweaked the alpha hyperparameter for optimization which landed us a 268k MSE. We also tried Lasso Regression. For Lasso Regression, we performed GridSearch on two different alpha values and fitted our training set using the model generated from GridSearch. Then, we generated predictions on the test set using that model. It did not perform as well as Linear Regression. This was probably because we did not pick very good hyperparameters. If we had more time, this would be something we would want to spend more time tinkering with and researching.

We also wanted to try other non-linear approaches, so we tried SVR with an ‘rbf’ kernel, since it is a widely used kernel and MLP Regressor, a neural network. However, when we computed the when fitted on the training data on the MLP Regressor and then generated predictions on the training data, we found that the MSE was very poor, despite having been fitted on the training data. We guessed that because neural networks can generate more complex models, the increase in complexity caused overfitting. The choice for SVR was done based on the success seen by the SVM in the baseline approach. Using the SVR allowed us to harness the tools of an SVM but now apply a regression model to it. However, after multiple C values were tested, the SVR proved to overgeneralize the data and give the same predictions for dates in the same week.

Two features we tried to include in our model were seasons and dates. We wanted to include season because of known effects of season on flu cases, for example. We also wanted to include dates to capture the continuity of time and give a sense of the relativity of feature values. We included seasons by converting the date into a datetime type. Then, we extracted the months and assigned each month to one of 4 categorical encoders, each one representing one of the 4 seasons.  We encoded dates by encoding them in a way such that it was ordinal, where each label represents the number of days that passed by since 1/1/2020 (for example, 1/1/2020 ->0, 1/2/2020 ->1, etc.).  However, we found that seasons did not help improve our model. One reason could be that this information is indirectly provided when we encoded the dates, since the date encoding represents the number of days passed since 1/1/2020 and therefore it would not be necessary to introduce more complexity to our model and include seasons as a feature. Out of curiosity, we also dropped the dates to see if our model would perform better, since dropping season improved our model. To our surprise, when we dropped dates, our Linear Regression model improved to 131K MSE. This could be because it added complexity to our model and since seasons did not have an effect on hospitalization and dates and season are relatively related, this could explain why dropping dates improved our model.



#<h2>Part 4: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The CSV shall contain TWO columns. The first column should be named "country_id" and be a concatenation of the country and date in the first two columns of the test_no_label.csv. This is because for the Kaggle competition we need a unique identifier for every row. For instance, the first entry should be "Belgium 9/7/2020". The second column of the prediction csv should have the same name as the target metric (either "next_week_increase_decrease" or "next_week_hospitalizations") with your generated predictions. Your file should have 1144 total rows excluding the column names. The order should be the same as in the test_baseline/creative_no_label.csv. A sample predication file can be downloaded from Kaggle for each problem.

In [None]:
# TODO
dfsub.to_csv(path_or_buf = '/Users/nanboo/Desktop/CS 4780/finalproj/submissionnoyounginfs.csv', index = False)
dfsub_creative.to_csv(path_or_buf = '/Users/nanboo/Desktop/CS 4780/finalproj/submissioncreative.csv', index = False)
# You may use pandas to generate a dataframe with country, date and your predictions first
# and then use to_csv to generate a CSV file.

<h2>Part 5: Resources and Literature Used</h2><p>

Links:

https://towardsdatascience.com/radial-basis-function-rbf-kernel-the-go-to-kernel-acf0d22c798a

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html

https://www.jeremyjordan.me/deep-neural-networks-preventing-overfitting/

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

https://chrisalbon.com/machine_learning/linear_regression/effect_of_alpha_on_lasso_regression/

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
