## Assignment

The task is to build and train a classifier given a labeled dataset and then use it to infer the labels of a given unlabeled evaluation dataset. 

You will find the training and evaluation data on canvas.

Here's the training data: TrainOnMe-2.csv 

Here's the evaluation data: EvaluateOnMe-2.csv 

Here's the ground truth: EvaluationGT-2.csv

You can use whatever python libraries you like! The steps below are suggestions, but feel free to try any other techniques we discussed in class.

You can submit the predicted labels by uploading them in csv format, which will then be compared to the ground truth.


In [365]:
# Import packages 
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# For feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# For min-max scaling
from sklearn.preprocessing import MinMaxScaler

# For encoding
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Some models you can try
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier


## Load the training and evaluation datasets

In [366]:
# Read datasets
df = pd.read_csv('TrainOnMe-2.csv').iloc[:, 1:]
eval_df = pd.read_csv('EvaluateOnMe-2.csv').iloc[:, 1:]

# Split your training dataset into features and labels
#X = df.iloc[:, 1:]
#y = df['y']

#df_null_idx = df.isnull().any(axis=1)
#print(len(df_null_idx))
#X.loc[134]
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1004 entries, 0 to 1003
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       997 non-null    object 
 1   x1      1000 non-null   float64
 2   x2      1000 non-null   float64
 3   x3      1000 non-null   float64
 4   x4      1000 non-null   float64
 5   x5      1000 non-null   float64
 6   x6      1000 non-null   object 
 7   x7      1000 non-null   object 
 8   x8      1000 non-null   float64
 9   x9      1000 non-null   float64
 10  x10     1000 non-null   float64
 11  x11     1000 non-null   float64
 12  x12     1000 non-null   object 
 13  x13     1000 non-null   float64
dtypes: float64(10), object(4)
memory usage: 109.9+ KB


## Data pre-processing

In [367]:
# Do some data pre-processing
# Remove NA values and noise
# Check the dtypes of all features
# Convert text columns to category
# Change categories to encoded labels using LabelEncoder()



num_cols = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x8', 'x9', 'x10', 'x11', 'x13']
cat_cols = ['y', 'x7', 'x12']

#print(num_cols)

for i, col in enumerate(num_cols):
    df[col] = pd.to_numeric(df[col], errors='coerce')

df.dropna(inplace=True)

y_encoder = LabelEncoder()
df['y']= y_encoder.fit_transform(df['y'])
y_cats = y_encoder.classes_

x7_encoder = LabelEncoder()
df['x7'] = x7_encoder.fit_transform(df['x7'])
x7_cats = x7_encoder.classes_

x12_encoder = LabelEncoder()
df['x12'] = x12_encoder.fit_transform(df['x12'])
x12_cats = x12_encoder.classes_

df.reset_index(drop=True, inplace=True)

X = df.iloc[:, 1:]
y = df['y']

X_null_idx = df[df.isnull().any(axis=1)].index
print("number of null entries: ", len(X_null_idx))
print("y category list: ", y_cats)
print("x7 category list: ", x7_cats)
print("x12 category list: ", x12_cats)
df.info()

number of null entries:  0
y category list:  ['Atsuto' 'Bob' 'Jorg' 'Shoogee']
x7 category list:  ['Erik Sven Fernström' 'Erik Sven Williams' 'Jerry Fernström'
 'Jerry Williams' 'Jerry från Solna']
x12 category list:  [False True]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       995 non-null    int64  
 1   x1      995 non-null    float64
 2   x2      995 non-null    float64
 3   x3      995 non-null    float64
 4   x4      995 non-null    float64
 5   x5      995 non-null    float64
 6   x6      995 non-null    float64
 7   x7      995 non-null    int64  
 8   x8      995 non-null    float64
 9   x9      995 non-null    float64
 10  x10     995 non-null    float64
 11  x11     995 non-null    float64
 12  x12     995 non-null    int64  
 13  x13     995 non-null    float64
dtypes: float64(11), int64(3)
memory usage: 109.0 KB


## Dealing with outliers

In [368]:
# Try to remove outliers from training data to improve performance
# There are different ways to do this but one way could be to use stats.zscore


z_scores = np.abs(stats.zscore(df[num_cols], axis=0))


#combined_z_scores = pd.Series(np.linalg.norm(z_scores, axis=1))
#combined_z_scores.describe()
outliers = np.where(z_scores > 3)[0]
#print(type(outliers))
#print(outliers.shape)
outlier_indices = np.unique(outliers)
#z_scores['x4'].describe()
print(np.min(df['x4']))
bitch = np.where(df['x4'] < -900000)[0]
print(bitch)

#print(z_scores.iloc[, :])

print(outlier_indices)

print(X.iloc[842, :])

#df = df[z_scores < 6]
#outlier_inds = np.where(np.abs(combined_z_scores) > 3)[0]


#X.drop(outlier_inds)
#y.drop(outlier_inds)

-90000000.35933
[842]
[ 55  64 145 219 249 267 279 343 399 450 453 489 525 538 567 604 609 775
 804 842 874 903 928]
x1     1.896880e+00
x2     6.781300e-01
x3    -1.389890e+00
x4    -9.000000e+07
x5     1.035667e+01
x6     8.733600e-01
x7     2.000000e+00
x8     1.240070e+00
x9    -2.378380e+00
x10   -7.605720e+00
x11   -2.107290e+00
x12    0.000000e+00
x13    9.414893e+01
Name: 842, dtype: float64


In [369]:
#std_coeff = 1.96
#zscore_mean = np.mean(combined_z_scores)
#zscore_std = np.std(combined_z_scores)
#thresh = zscore_mean + std_coeff*zscore_std
#outlier_inds = np.where(np.abs(combined_z_scores) > 4)[0]
#
#print(combined_z_scores[outlier_inds])

In [370]:
X.drop(index=outlier_indices, inplace=True)
y.drop(index=outlier_indices, inplace=True)


print(np.min(X, axis=0))
X.info()

x1     -1.73821
x2     -2.05038
x3    -10.60992
x4    -10.16991
x5      9.78143
x6     -2.02696
x7      0.00000
x8     -3.41003
x9     -3.68549
x10   -13.26419
x11    -6.00052
x12     0.00000
x13   -87.16365
dtype: float64
<class 'pandas.core.frame.DataFrame'>
Index: 972 entries, 0 to 994
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      972 non-null    float64
 1   x2      972 non-null    float64
 2   x3      972 non-null    float64
 3   x4      972 non-null    float64
 4   x5      972 non-null    float64
 5   x6      972 non-null    float64
 6   x7      972 non-null    int64  
 7   x8      972 non-null    float64
 8   x9      972 non-null    float64
 9   x10     972 non-null    float64
 10  x11     972 non-null    float64
 11  x12     972 non-null    int64  
 12  x13     972 non-null    float64
dtypes: float64(11), int64(2)
memory usage: 106.3 KB


In [371]:
X.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13
0,2.20274,-0.0469,-4.69816,-9.078,10.13118,-0.089,1,0.54191,0.52041,-5.6699,-0.93831,0,107.78776
1,2.01516,-0.12177,-4.24286,-9.79772,9.98259,-0.01485,1,-1.21671,1.18749,-9.253,-1.21892,0,98.63633
2,0.02598,-0.24764,0.39977,-9.54167,10.53391,-0.27978,1,-2.39764,1.95167,-9.46447,-2.6891,1,1.4988
3,0.39778,-0.83343,-2.14272,-9.0655,10.15047,-0.84583,4,0.09768,0.9201,-11.17952,0.59877,0,18.81785
4,1.25346,0.0932,1.54063,-9.33171,9.92016,0.09889,2,-0.46134,0.16381,-12.07755,1.09106,1,63.44326


## Scaling the features

In [372]:
# Scale your features
# You can try both standardscaler and minmaxscaler and see which works better
print(np.max(X['x4']))
print(np.min(X['x4']))
print(np.median(X['x4']))
print(pd.Series(X['x4']).describe())
print(len(X['x4'][X['x4']< -100]))
min_max_scaler = MinMaxScaler(feature_range=(0, 1))
X[num_cols] = min_max_scaler.fit_transform(X[num_cols])


#print(X['x4'].shape)
print(X.shape)
print(y.shape)
X.head()

-8.2922
-10.16991
-9.175795
count    972.000000
mean      -9.187813
std        0.383982
min      -10.169910
25%       -9.468750
50%       -9.175795
75%       -8.915192
max       -8.292200
Name: x4, dtype: float64
0
(972, 13)
(972,)


Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13
0,0.726424,0.510547,0.311269,0.581512,0.288689,0.487385,1,0.559048,0.539837,0.866377,0.395649,0,0.722285
1,0.691848,0.491468,0.335242,0.198215,0.166041,0.506033,1,0.310271,0.625458,0.457607,0.373717,0,0.68838
2,0.325188,0.459393,0.579688,0.334578,0.621109,0.439405,1,0.143214,0.723542,0.433482,0.258812,1,0.32849
3,0.393721,0.310116,0.44582,0.588169,0.304612,0.297047,4,0.496207,0.591138,0.237825,0.515783,0,0.392656
4,0.551446,0.546249,0.639758,0.446395,0.11451,0.534638,2,0.417127,0.494066,0.135375,0.554259,1,0.557991


## Feature selection

In [373]:
# You could try to apply SelectKBest class to extract the most useful features (this is optional but MIGHT improve accuracy)
# Remove whichever features that are not useful

selector = SelectKBest(chi2).fit(X, y)

X_new = selector.transform(X)
selected_feature_names = X.columns[selector.get_support()]
X_new = pd.DataFrame(selector.transform(X), columns=selected_feature_names)
print(type(X_new))
print("Selected features:", selected_feature_names)
X = X_new
X.head()

<class 'pandas.core.frame.DataFrame'>
Selected features: Index(['x1', 'x2', 'x4', 'x5', 'x7', 'x8', 'x10', 'x11', 'x12', 'x13'], dtype='object')


Unnamed: 0,x1,x2,x4,x5,x7,x8,x10,x11,x12,x13
0,0.726424,0.510547,0.581512,0.288689,1.0,0.559048,0.866377,0.395649,0.0,0.722285
1,0.691848,0.491468,0.198215,0.166041,1.0,0.310271,0.457607,0.373717,0.0,0.68838
2,0.325188,0.459393,0.334578,0.621109,1.0,0.143214,0.433482,0.258812,1.0,0.32849
3,0.393721,0.310116,0.588169,0.304612,4.0,0.496207,0.237825,0.515783,0.0,0.392656
4,0.551446,0.546249,0.446395,0.11451,2.0,0.417127,0.135375,0.554259,1.0,0.557991


## Split your data to train and test set

In [374]:
# Train, test, split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state = 0)


## Fit the model

* You can try models other than the models listed below
* You can try different hyperparameters
* Evaluate your model using cross-validation

In [375]:
# Try linear SVM classifier
linear = SVC(kernel='linear', C=0.5).fit(X_train, y_train)
print(linear.score(X_test,y_test))

# Evaluate using cross-validation
scores = cross_val_score(linear,X_test,y_test,cv=5)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.5
0.45 accuracy with a standard deviation of 0.07


In [376]:
#Try decision tree classifier
decision_tree = DecisionTreeClassifier(criterion = "gini").fit(X_train, y_train)
print(decision_tree.score(X_test,y_test))

# Evaluate using cross-validation
scores = cross_val_score(decision_tree,X_test,y_test,cv=10)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.40816326530612246
0.40 accuracy with a standard deviation of 0.13




In [377]:
#Try random forest classifier
random_forest = RandomForestClassifier().fit(X_train, y_train)
print(random_forest.score(X_test,y_test))

scores = cross_val_score(random_forest,X_test,y_test,cv=10)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

0.6122448979591837




0.59 accuracy with a standard deviation of 0.10


In [None]:
# Use your best model to predict the labels for the evaluation set

y_pred = best_model.predict(X_eval)

print(y_pred)


In [599]:
# Save your predictions to a csv and upload it to canvas

pd.DataFrame(y_pred).to_csv("file.txt",index = False,header=False)