## Processing a classification dataset

The dataset which will be used can be found here [Titanic](https://www.kaggle.com/competitions/titanic).

Main goal is to improve the general classification result which is defined as accuracy 

There will be: 
1. Checking the accuracy before applaing any preprocessing algorithms
2. Analyzing our data
3. Using preprocessing algorithms:
- Feature normalization and standardization
- Feature selection  
- Feature extraction
4. Comparing the results

In [1]:
### libraries
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

**1.Checking the accuracy**

In [2]:
### checking the initial value   ?????????????????????????????
classifier = SVC()
pd.options.mode.copy_on_write = True


train_data = pd.read_csv('train.csv')
X = train_data[['Pclass','Age','Fare','Sex']]
X['Sex'] = X.Sex.copy().apply(lambda x : 1.0 if x == 'male' else 2.0).copy()
X = X.fillna(0)
X = X.to_numpy()
y = train_data['Survived']

y = y.to_numpy()

test_data = pd.read_csv('test.csv')
X_test = train_data[['Pclass','Age','Fare','Sex']]
X_test.Sex = X_test.Sex.apply(lambda x : 1.0 if x == 'male' else 2.0)
X_test = X_test.fillna(0)
X_test = X_test.to_numpy()
y_test = train_data['Survived']
y_test = y_test.to_numpy()

classifier.fit(X,y)
predicts = classifier.predict(X_test)
report = classification_report(y_test, predicts)
accuracy_before = accuracy_score(y_test, predicts)

print("Accuracy before preprocessing: ", accuracy_before)

Accuracy before preprocessing:  0.6868686868686869


**2.Analyzing our data**

In [3]:
data = pd.read_csv('train.csv')
total_rows = len(data)
split_index = int(0.5 * total_rows)

data_df = pd.DataFrame(data.iloc[:split_index])
data_df_2 = pd.DataFrame(data.iloc[split_index:])

In [4]:
headers = data_df.columns.to_list()
procentage = []
for head in headers:
    procentage.append( round((data_df[head].nunique()/ len(data_df[head])),3))

variable_types = ['discrete (unique)', 'discrete', 'discrete', 'discrete', 'categorical', 'continuous', 'discrete', 'discrete', 'discrete', 'continuous', 'categorical','categorical']
variable_df = pd.DataFrame({'Column': headers, 'Variable_Type': variable_types, 'the percentage values of unique values': procentage})
variable_df

Unnamed: 0,Column,Variable_Type,the percentage values of unique values
0,PassengerId,discrete (unique),1.0
1,Survived,discrete,0.004
2,Pclass,discrete,0.007
3,Name,discrete,1.0
4,Sex,categorical,0.004
5,Age,continuous,0.166
6,SibSp,discrete,0.016
7,Parch,discrete,0.013
8,Ticket,discrete,0.847
9,Fare,continuous,0.398


- drop columns with almost unique values ( PassengerId, Name, Ticket)
- fill NaN values with mean value

In [5]:
## change all NaN values to mean()
print('NaN values for Embarked: ',data_df['Embarked'].isnull().sum())
print('NaN values for Cabin: ',data_df['Cabin'].isnull().sum())
print('NaN values for Age: ',data_df['Age'].isnull().sum())

embarked_mode = data_df['Embarked'].mode()[0]
cabin_mode = data_df['Cabin'].mode()[0]
age_mean = round(data_df['Age'].mean(),2)

for index, row in data_df.iterrows():
    if pd.isnull(row['Age']):
        data_df.at[index, 'Age'] = age_mean

    if pd.isnull(row['Embarked']):
        data_df.at[index, 'Embarked'] = embarked_mode

    if pd.isnull(row['Cabin']):
        data_df.at[index, 'Cabin'] = cabin_mode

print('NaN values for Embarked after: ',data_df['Embarked'].isnull().sum())
print('NaN values for Cabin after: ',data_df['Cabin'].isnull().sum())
print('NaN values for Age after: ',data_df['Age'].isnull().sum())

## changing caterogical attributes to discrete
columns_to_change = ['Sex', 'Cabin', 'Embarked', 'Name', 'Ticket']

label_encoder = LabelEncoder()
for column in columns_to_change:
    data_df[column] = label_encoder.fit_transform(data_df[column])

data_df.head()

NaN values for Embarked:  1
NaN values for Cabin:  348
NaN values for Age:  88
NaN values for Embarked after:  0
NaN values for Cabin after:  0
NaN values for Age after:  0


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,59,1,22.0,1,0,275,7.25,31,2
1,2,1,1,98,0,38.0,1,0,325,71.2833,41,0
2,3,1,3,175,0,26.0,0,0,369,7.925,31,2
3,4,1,1,135,0,35.0,1,0,28,53.1,25,2
4,5,0,3,7,1,35.0,0,0,251,8.05,31,2


**3.1.Normalization and standarization**

In [6]:
## Normalization and standarization

X = pd.concat([data_df.iloc[:, :1], data_df.iloc[:, 2:]], axis=1)
cols_all = data_df.columns.tolist()
cols = cols_all[:1] + cols_all[2:]

norm = MinMaxScaler(feature_range=(0,1)).fit(X)
normalized_data = pd.DataFrame(norm.transform(X), columns=cols)

scale = StandardScaler().fit(normalized_data)
normalized_data = pd.DataFrame(scale.transform(normalized_data), columns=cols)

normalized_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,-1.728163,0.80402,-1.268876,0.789992,-0.530352,0.34847,-0.491216,0.788949,-0.508962,-0.167394,0.588629
1,-1.720378,-1.64659,-0.96528,-1.265835,0.743711,0.34847,-0.491216,1.245897,0.788128,0.78032,-1.966881
2,-1.712594,0.80402,-0.365872,-1.265835,-0.211836,-0.498902,-0.491216,1.648012,-0.495289,-0.167394,0.588629
3,-1.704809,-1.64659,-0.677253,-1.265835,0.504824,0.34847,-0.491216,-1.468375,0.419798,-0.736022,0.588629
4,-1.697025,0.80402,-1.673671,0.789992,0.504824,-0.498902,-0.491216,0.569614,-0.492757,-0.167394,0.588629


In [7]:
#### the same for test set
embarked_mode_2 = data_df_2['Embarked'].mode()[0]
cabin_mode_2 = data_df_2['Cabin'].mode()[0]
age_mean_2 = round(data_df_2['Age'].mean(), 2)

for index, row in data_df_2.iterrows():
    if pd.isnull(row['Age']):
        data_df_2.at[index, 'Age'] = age_mean_2

    if pd.isnull(row['Embarked']):
        data_df_2.at[index, 'Embarked'] = embarked_mode_2

    if pd.isnull(row['Cabin']):
        data_df_2.at[index, 'Cabin'] = cabin_mode_2

# Changing categorical attributes to discrete
columns_to_change = ['Sex', 'Cabin', 'Embarked', 'Name', 'Ticket']

label_encoder = LabelEncoder()
for column in columns_to_change:
    data_df_2[column] = label_encoder.fit_transform(data_df_2[column])

# Normalization and standardization
X_2 = data_df_2.copy()
cols_all_2 = data_df_2.columns.tolist()

norm_2 = MinMaxScaler(feature_range=(0, 1)).fit(X_2)
normalized_data_2 = pd.DataFrame(norm_2.transform(X_2), columns=cols_all_2)

scale_2 = StandardScaler().fit(normalized_data_2)
normalized_data_2 = pd.DataFrame(scale_2.transform(normalized_data_2), columns=cols_all_2)

normalized_data_2.head()

#### BEFORE ########################## CORRECT?
classifier = SVC()
pd.options.mode.copy_on_write = True

X = data_df[['Pclass','Age','Fare','Sex']]
X['Sex'] = X.Sex.copy().apply(lambda x : 1.0 if x == 'male' else 2.0).copy()
X = X.fillna(0)
X = X.to_numpy()
y = data_df['Survived']

y = y.to_numpy()

X_test = data_df_2[['Pclass','Age','Fare','Sex']]
X_test.Sex = X_test.Sex.apply(lambda x : 1.0 if x == 'male' else 2.0)
X_test = X_test.fillna(0)
X_test = X_test.to_numpy()
y_test = data_df_2['Survived']
y_test = y_test.to_numpy()

classifier.fit(X,y)
predicts = classifier.predict(X_test)
report = classification_report(y_test, predicts)
accuracy_before = accuracy_score(y_test, predicts)

print("Accuracy before preprocessing: ", accuracy_before)



#### AFTER ####################### CORRECT?
classifier = SVC()
pd.options.mode.copy_on_write = True

X = normalized_data[['Pclass','Age','Fare','Sex']]
X = X.to_numpy()
y = data_df['Survived']
y = y.to_numpy()

X_test = normalized_data_2[['Pclass','Age','Fare','Sex']]
X_test = X_test.to_numpy()
y_test = data_df_2['Survived']
y_test = y_test.to_numpy()

classifier.fit(X,y)
predicts = classifier.predict(X_test)
report = classification_report(y_test, predicts)
accuracy_before = accuracy_score(y_test, predicts)

print("Accuracy after preprocessing: ", accuracy_before)

Accuracy before preprocessing:  0.679372197309417
Accuracy after preprocessing:  0.7892376681614349


**3.2.Feature selection**

**3.3.Feature extraction**