## Processing a classification dataset

The dataset which will be used can be found here [Titanic](https://www.kaggle.com/competitions/titanic).

Main goal is to improve the general classification result which is defined as accuracy 

There will be: 
1. Checking the accuracy before applaing any preprocessing algorithms
2. Analyzing our data
3. Using preprocessing algorithms:
- Feature normalization and standardization
- Feature selection  
- Feature extraction
4. Comparing the results

In [1]:
### libraries
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler

**1.Checking the accuracy**

In [2]:
### checking the initial value   ?????????????????????????????
classifier = SVC()
pd.options.mode.copy_on_write = True


train_data = pd.read_csv('train.csv')
X = train_data[['Pclass','Age','Fare','Sex']]
X['Sex'] = X.Sex.copy().apply(lambda x : 1.0 if x == 'male' else 2.0).copy()
X = X.fillna(0)
X = X.to_numpy()
y = train_data['Survived']

y = y.to_numpy()

test_data = pd.read_csv('test.csv')
X_test = train_data[['Pclass','Age','Fare','Sex']]
X_test.Sex = X_test.Sex.apply(lambda x : 1.0 if x == 'male' else 2.0)
X_test = X_test.fillna(0)
X_test = X_test.to_numpy()
y_test = train_data['Survived']
y_test = y_test.to_numpy()

classifier.fit(X,y)
predicts = classifier.predict(X_test)
report = classification_report(y_test, predicts)
accuracy_before = accuracy_score(y_test, predicts)

print("Accuracy before preprocessing: ", accuracy_before)

Accuracy before preprocessing:  0.6868686868686869


**2.Analyzing our data**

In [3]:
data = pd.read_csv('train.csv')
total_rows = len(data)
split_index = int(0.2 * total_rows)

data_df = pd.DataFrame(data.iloc[:split_index])
data_df_2 = pd.DataFrame(data.iloc[split_index:])

In [4]:
headers = data_df.columns.to_list()
procentage = []
for head in headers:
    procentage.append( round((data_df[head].nunique()/ len(data_df[head])),3))

variable_types = ['discrete (unique)', 'discrete', 'discrete', 'discrete', 'categorical', 'continuous', 'discrete', 'discrete', 'discrete', 'continuous', 'categorical','categorical']
variable_df = pd.DataFrame({'Column': headers, 'Variable_Type': variable_types, 'the percentage values of unique values': procentage})
variable_df

Unnamed: 0,Column,Variable_Type,the percentage values of unique values
0,PassengerId,discrete (unique),1.0
1,Survived,discrete,0.011
2,Pclass,discrete,0.017
3,Name,discrete,1.0
4,Sex,categorical,0.011
5,Age,continuous,0.343
6,SibSp,discrete,0.039
7,Parch,discrete,0.034
8,Ticket,discrete,0.91
9,Fare,continuous,0.584


- drop columns with almost unique values ( PassengerId, Name, Ticket)
- fill NaN values with mean value

In [5]:
## change all NaN values to mean()
print('NaN values for Embarked: ',data_df['Embarked'].isnull().sum())
print('NaN values for Cabin: ',data_df['Cabin'].isnull().sum())
print('NaN values for Age: ',data_df['Age'].isnull().sum())

embarked_mode = data_df['Embarked'].mode()[0]
cabin_mode = data_df['Cabin'].mode()[0]
age_mean = round(data_df['Age'].mean(),2)

for index, row in data_df.iterrows():
    if pd.isnull(row['Age']):
        data_df.at[index, 'Age'] = age_mean

    if pd.isnull(row['Embarked']):
        data_df.at[index, 'Embarked'] = embarked_mode

    if pd.isnull(row['Cabin']):
        data_df.at[index, 'Cabin'] = cabin_mode

print('NaN values for Embarked after: ',data_df['Embarked'].isnull().sum())
print('NaN values for Cabin after: ',data_df['Cabin'].isnull().sum())
print('NaN values for Age after: ',data_df['Age'].isnull().sum())

## changing caterogical attributes to discrete
columns_to_change = ['Sex', 'Cabin', 'Embarked', 'Name', 'Ticket']

label_encoder = LabelEncoder()
for column in columns_to_change:
    data_df[column] = label_encoder.fit_transform(data_df[column])

data_df.head()

NaN values for Embarked:  1
NaN values for Cabin:  143
NaN values for Age:  35
NaN values for Embarked after:  0
NaN values for Cabin after:  0
NaN values for Age after:  0


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,21,1,22.0,1,0,114,7.25,11,2
1,2,1,1,39,0,38.0,1,0,139,71.2833,17,0
2,3,1,3,67,0,26.0,0,0,159,7.925,11,2
3,4,1,1,53,0,35.0,1,0,10,53.1,11,2
4,5,0,3,1,1,35.0,0,0,102,8.05,11,2


**3.1.Normalization and standarization**

In [6]:
## Normalization and standarization

X = pd.concat([data_df.iloc[:, :1], data_df.iloc[:, 2:]], axis=1)
cols_all = data_df.columns.tolist()
cols = cols_all[:1] + cols_all[2:]

norm = MinMaxScaler(feature_range=(0,1)).fit(X)
normalized_data = pd.DataFrame(norm.transform(X), columns=cols)

scale = StandardScaler().fit(normalized_data)
normalized_data = pd.DataFrame(scale.transform(normalized_data), columns=cols)

normalized_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,-1.722347,0.709984,-1.313655,0.731083,-0.448164,0.277957,-0.488311,0.736245,-0.547959,-0.200607,0.58706
1,-1.702886,-1.792533,-0.963347,-1.367833,0.734833,0.277957,-0.488311,1.268542,1.169363,1.190617,-1.930933
2,-1.683424,0.709984,-0.418423,-1.367833,-0.152415,-0.546648,-0.488311,1.69438,-0.529857,-0.200607,0.58706
3,-1.663963,-1.792533,-0.690885,-1.367833,0.513021,0.277957,-0.488311,-1.478112,0.681702,-0.200607,0.58706
4,-1.644501,0.709984,-1.702886,0.731083,0.513021,-0.546648,-0.488311,0.480742,-0.526504,-0.200607,0.58706


In [7]:
#### the same for test set
embarked_mode_2 = data_df_2['Embarked'].mode()[0]
cabin_mode_2 = data_df_2['Cabin'].mode()[0]
age_mean_2 = round(data_df_2['Age'].mean(), 2)

for index, row in data_df_2.iterrows():
    if pd.isnull(row['Age']):
        data_df_2.at[index, 'Age'] = age_mean_2

    if pd.isnull(row['Embarked']):
        data_df_2.at[index, 'Embarked'] = embarked_mode_2

    if pd.isnull(row['Cabin']):
        data_df_2.at[index, 'Cabin'] = cabin_mode_2

# Changing categorical attributes to discrete
columns_to_change = ['Sex', 'Cabin', 'Embarked', 'Name', 'Ticket']

label_encoder = LabelEncoder()
for column in columns_to_change:
    data_df_2[column] = label_encoder.fit_transform(data_df_2[column])

# Normalization and standardization
X_2 = data_df_2.copy()
cols_all_2 = data_df_2.columns.tolist()

norm_2 = MinMaxScaler(feature_range=(0, 1)).fit(X_2)
normalized_data_2 = pd.DataFrame(norm_2.transform(X_2), columns=cols_all_2)

scale_2 = StandardScaler().fit(normalized_data_2)
normalized_data_2 = pd.DataFrame(scale_2.transform(normalized_data_2), columns=cols_all_2)

normalized_data_2.head()

#### BEFORE ########################## CORRECT?
classifier = SVC()
pd.options.mode.copy_on_write = True

X = data_df[['Pclass','Age','Fare','Sex']]
X['Sex'] = X.Sex.copy().apply(lambda x : 1.0 if x == 'male' else 2.0).copy()
X = X.fillna(0)
X = X.to_numpy()
y = data_df['Survived']

y = y.to_numpy()

X_test = data_df_2[['Pclass','Age','Fare','Sex']]
X_test.Sex = X_test.Sex.apply(lambda x : 1.0 if x == 'male' else 2.0)
X_test = X_test.fillna(0)
X_test = X_test.to_numpy()
y_test = data_df_2['Survived']
y_test = y_test.to_numpy()

classifier.fit(X,y)
predicts = classifier.predict(X_test)
report = classification_report(y_test, predicts)
accuracy_before = accuracy_score(y_test, predicts)

print("Accuracy before preprocessing: ", accuracy_before)



#### AFTER ####################### CORRECT?
classifier = SVC()
pd.options.mode.copy_on_write = True

X = normalized_data[['Pclass','Age','Fare','Sex']]
X = X.to_numpy()
y = data_df['Survived']
y = y.to_numpy()

X_test = normalized_data_2[['Pclass','Age','Fare','Sex']]
X_test = X_test.to_numpy()
y_test = data_df_2['Survived']
y_test = y_test.to_numpy()

classifier.fit(X,y)
predicts = classifier.predict(X_test)
report = classification_report(y_test, predicts)
accuracy_before = accuracy_score(y_test, predicts)

print("Accuracy after preprocessing: ", accuracy_before)

Accuracy before preprocessing:  0.6016830294530154
Accuracy after preprocessing:  0.7784011220196353


**3.2.Feature selection**

**3.3.Feature extraction**