# Assignment 3 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Personal Details:

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [24]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [25]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [26]:
# word net installation:

# unmark if you want to use and need to install
#!pip install wn
#!python -m wn download omw-he:1.4

In [27]:
def Word_Net(wn_dataframe):
    import wn
    ####################run vactorizer to get words to replace##########
    vector = TfidfVectorizer(min_df=0.01,max_df=0.95,ngram_range=(1,5),binary=True)
    df_columns = vector.fit_transform(wn_dataframe["story"])
    df_columns=pd.DataFrame(df_columns.toarray(), columns=vector.get_feature_names_out())
    print(f"shape is:",df_columns.shape)
    ###################
    print(r"words replaced:")
    print(df_columns.columns)
    wordnet_he = wn.Wordnet('omw-he:1.4')
    for feature in df_columns.columns:#go through the features and replace them 
            w1 = wordnet_he.synsets(feature)
            if(w1):#meaning the word exist in word_net database
                if(w1[0].pos=="a" ):#if the word is adjective continue
                    feature_lema=w1[0].lemmas()
                    print("replaced",feature,feature_lema[0])
                    wn_dataframe["story"]=wn_dataframe["story"].str.replace(feature,feature_lema[0])
                    
    return wn_dataframe

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [28]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
# !pip install hebrew_tokenizer

In [29]:
# Hebrew tokenizer import:

# unmark if you want to use:
# import hebrew_tokenizer as ht

### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [30]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [31]:
df_train.head(8)
df_train.shape

Unnamed: 0,story,gender
0,"כשחבר הזמין אותי לחול, לא באמת חשבתי שזה יקרה,...",m
1,לפני שהתגייסתי לצבא עשיתי כל מני מיונים ליחידו...,m
2,מאז שהתחילו הלימודים חלומו של כל סטודנט זה הפנ...,f
3,"כשהייתי ילד, מטוסים היה הדבר שהכי ריתק אותי. ב...",m
4,‏הייתי מדריכה בכפר נוער ומתאם הכפר היינו צריכי...,f
5,לפני כ3 חודשים טסתי לרומא למשך שבוע. טסתי במטו...,f
6,אני כבר שנתיים נשוי והשנה אני ואישתי סוף סוף י...,m
7,השנה התחלנו שיפוץ בדירה שלנו בתל אביב. הדירה ה...,f


(753, 2)

In [32]:
df_test.head(3)
df_test.shape

Unnamed: 0,test_example_id,story
0,0,כל קיץ אני והמשפחה נוסעים לארצות הברית לוס אנג...
1,1,"הגעתי לשירות המדינה אחרי שנתיים כפעיל בתנועת ""..."
2,2,אחת האהבות הגדולות שלי אלו הכלבים שלי ושל אישת...


(323, 2)

### Your implementation:
Write your code solution in the following code-cells

# personal imports*

In [33]:
from sklearn.metrics import make_scorer #allowed from comment from https://md.hit.ac.il/mod/forum/discuss.php?d=116482
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# cleaning

In [34]:
def clean_text(df):
    df_copy = df.copy(deep=True)
    df_copy["story"] = df_copy["story"].apply(lambda txt: ''.join([c if (c >= 'א' and c <= 'ת') or c == ' ' else '' for c in txt]))
    df_copy=Word_Net(df_copy)
    return df_copy

# splitting the target data to male and female

In [35]:
def split_gender(df_train):
    X_train = df_train.drop(columns=["gender"])
    y_male = (df_train["gender"] == "m").astype(int)
    y_female = (df_train["gender"] == "f").astype(int)

    return X_train, y_male, y_female

# calculating the target data

In [36]:
def calculate(param,model_name,dataframe,train):
    
    pipeline = Pipeline([
    ('vector', vectorizer()),
    ('model', model_name())
        
])
    return dataframe,score

# vector parameters

In [37]:
def load_vector_params():
    full_vector_param=[{'max_df': 0.9, 'min_df': 0.05, 'ngram_range':(1,9),"binary" : False,"normalize" : True}
    ,{'max_df': 0.60, 'min_df': 0.03, 'ngram_range':(1,7),"binary" : False,"normalize" : False}                   
    ,{'max_df': 0.99, 'min_df': 0.01, 'ngram_range' : (1,2),"binary" : True,"normalize" : True}
    ,{'max_df': 0.85, 'min_df': 0.09, 'ngram_range':(1,5),"binary" : True,"normalize" : True}
    ,{'max_df': 0.85, 'min_df': 0.15, 'ngram_range':(1,5),"binary" : False,"normalize" : True}
    ,{'max_df': 0.80, 'min_df': 0.39, 'ngram_range':(1,5),"binary" : True,"normalize" : True}
    ,{'max_df': 0.90, 'min_df': 0.03, 'ngram_range':(1,5),"binary" : True,"normalize" : True}
    ,{'max_df': 0.90, 'min_df': 0.03, 'ngram_range':(1,5),"binary" : True,"normalize" : True}
    ,{'max_df': 0.90, 'min_df': 0.03, 'ngram_range':(1,5),"binary" : False,"normalize" : True}
    ,{'max_df': 0.60, 'min_df': 0.03, 'ngram_range':(1,7),"binary" : False,"normalize" : True}                 
                      ]
    
    
    vec_name_list=[CountVectorizer(), TfidfVectorizer()]
    return full_vector_param,vec_name_list

# vectorizering 

In [38]:
def vectroziering(vec_params,vec_name,dataframe):
    if(type(vec_name)==type(TfidfVectorizer())):
        vector = TfidfVectorizer(min_df=vec_params['min_df'],max_df=vec_params['max_df'],ngram_range=vec_params['ngram_range'],binary=vec_params['binary'])
        X_train = vector.fit_transform(dataframe["story"])
        X_train_normalized=X_train
        if(vec_params['normalize']):
            normalized=preprocessing.normalize(X_train,norm="l2")#can also do l2
            X_train_normalized=pd.DataFrame(normalized.toarray(), columns=vector.get_feature_names_out())
    if(type(vec_name)==type(CountVectorizer())):
        vector = CountVectorizer(min_df=vec_params['min_df'],max_df=vec_params['max_df'],ngram_range=vec_params['ngram_range'],binary=vec_params['binary'])
        X_train = vector.fit_transform(dataframe["story"])
        X_train_normalized=X_train
        if(vec_params['normalize']):
            normalized=preprocessing.normalize(X_train,norm="l2")#can also do l2
            X_train_normalized=pd.DataFrame(normalized.toarray(), columns=vector.get_feature_names_out())
    return vector,X_train_normalized

# model parameters

In [39]:
def load_model_params():#dictionary that contains the arguments for each model that we used
#LogisticRegression
    model_param={"LogisticRegression" : {
                 
                 "model__random_state":[42]
    },
                 "KNN" :{
                 
                 "model__n_neighbors" : [3,4],
                           "model__algorithm":["auto", "ball_tree", "kd_tree", "brute"],
                           "model__n_jobs":[-1]
                            
                 },
                          "DecisionTreeClassifier" : {
                              "model__max_depth" : [14,12,10,8],
                              "model__random_state" : [42]
                          },
                           "SGDClassifier" : {
                               "model__loss":["hinge","log_loss","modified_huber","squared_hinge","perceptron","squared_error","huber"],
                               "model__n_jobs" : [-1],
                               "model__learning_rate" : ["optimal","invscaling","constant","adaptive"],
                               "model__penalty" : ["l1","l2","elasticnet",None],
                               "model__alpha" : [0.0001,0.0002,0.0003,0.0005]
                           },
                 "SVC" : {
                     "model__kernel" : ["linear","poly"]
                     
                 }
                 
                           
                 
                    
                 
                }
    model_dict={"LogisticRegression":LogisticRegression(),"KNN":KNeighborsClassifier(),"DecisionTreeClassifier":DecisionTreeClassifier(),
               "SGDClassifier":SGDClassifier(),"SVC":SVC()}
             
    return model_param,model_dict

# parameters for the machine learning

In [43]:
df_train = clean_text(df_train)
df_test = clean_text(df_test)
df_train["story"].to_csv('Xclean.csv', index=False,encoding='utf-8-sig')
df_test["story"].to_csv('Yclean.csv', index=False,encoding='utf-8-sig')
X_train, y_male, y_female = split_gender(df_train.copy())
vec_list_param,vec_name_list=load_vector_params()
model_param,model_dict=load_model_params()
best_score=0
score=0
best_parm=None
best_vec_parm=None
best_vectorizer=None
best_vector=None
for vec_params in vec_list_param:
    for vec_name in vec_name_list:
        vector,dataframe_vectorized=vectroziering(vec_params,vec_name,X_train)
        
        ##############################calculating#####################################
        #gussian:
        model=GaussianNB()
        male_score = cross_val_score(model, dataframe_vectorized,y_male , cv=10,scoring=make_scorer(f1_score, average='macro')) 
        print(f"male best score:",male_score.mean())
        female_score=cross_val_score(model, dataframe_vectorized,y_female, cv=10,scoring=make_scorer(f1_score, average='macro')) 
        print(f"female best score:",female_score.mean())
        #logic regression##########
        
        for model_name,model in model_dict.items() :
            P_model=Pipeline([('model', model)])
            grid_search = GridSearchCV(estimator=P_model, param_grid=model_param[model_name], cv=10, n_jobs=-1,
                               scoring=make_scorer(f1_score, average='macro'))
            grid_search.fit(dataframe_vectorized,y_female)
            score=grid_search.best_score_+score
            grid_search.fit(dataframe_vectorized,y_male)
            score=grid_search.best_score_+score
            score=score/2
            print(f"the score for {model_name} is:{score}")
            if(best_score<score):
                best_score=score
                best_parm=grid_search.best_params_
                best_grid_search=grid_search
                best_estimator=best_model = grid_search.best_estimator_
                best_vectorizer=vec_params
                best_vec_parm=vec_name
                best_model_name=model_name
                best_dataframe_vectorized=dataframe_vectorized
                best_vector=vector
            score=0
            
            
print(f"total best score is:{best_score}")
print(f"total best parm for model is:{best_parm}")
print(f"total best parm for vector is{best_vec_parm} with vectorizer:{best_vectorizer}")


        
        
        
        





shape is: (753, 5417)
words replaced:
Index(['אבא', 'אבא שלי', 'אבי', 'אביב', 'אבל', 'אבל אז', 'אבל אחרי', 'אבל אין',
       'אבל אמרתי', 'אבל אני',
       ...
       'תקופת', 'תקופת המבחנים', 'תקופת הקורונה', 'תקין', 'תשובה', 'תשובות',
       'תשומת', 'תשומת לב', 'תשלום', 'תשע'],
      dtype='object', length=5417)
replaced אוהב אוֹהֵב
replaced אחדים אֲחָדִים
replaced אחר אַחֵר
replaced אחרון אַחֲרוֹן
replaced אסור אָסוּר
replaced אפל אָפֵל
replaced בא בָּא
replaced בהול בָּהוּל
replaced בטוח בָּטוּחַ
replaced גדול גָּדוֹל
replaced הגון הָגוּן
replaced הרוס הָרוּס
replaced זריז זָרִיז
replaced חיובי חִיּוּבִי
replaced חם חַם
replaced חמישי חֲמִישִׁי
replaced חסר חָסֵר
replaced טעים טָעִים
replaced יפה יָפֶה
replaced ישראלי יִשְׂרְאֵלִי
replaced כבד כָּבֵד
replaced כדאי כְּדָאִי
replaced כולל כּוֹלֵל
replaced מבין מֵבִין
replaced מגניב מַגְנִיב
replaced מדעי מַדָּעִי
replaced מהצד מֵהַצַּד
replaced מוזר מוּזָר
replaced מידי מִיָּדִי
replaced מינוס מִינוּס
replaced מכסף מֻכְסָף
replaced 

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5661744839445167


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5743096355695612


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6692041623928916


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.48536664767804166
male best score: 0.6401585500976346
female best score: 0.6401585500976346


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5995339858345975


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5791251505207932


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.681637163932794


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.49622625339420534
male best score: nan
female best score: nan


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.6299771389299051


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5196597332943813


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.6163119379875871


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6571230682680497


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.6331939075245481
male best score: nan
female best score: nan


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5894785963962595


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.6089430388625983


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6847625489752851


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.5163966203145671
male best score: 0.49739146356185754
female best score: 0.49739146356185754


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.6001373399211001


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5820210125791292


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6964714654219841


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.47148936429895033
male best score: 0.49739146356185754
female best score: 0.49739146356185754


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.6369471646751683


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.6011657508404866


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.70662548754468


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.49208548236930383
male best score: 0.6153182822309444
female best score: 0.6153182822309444


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5623110859634555


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5206910284711854


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6168228249779875


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.48196622269744005
male best score: 0.618676435032087
female best score: 0.618676435032087


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5258637705102904


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5206752131636903


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6135766933838949


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.4758672086767947
male best score: 0.5841426484575775
female best score: 0.5841426484575775


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4437362581051529


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5583494398813426


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5288502003931868


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.5976748326166064


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.5203780319038225
male best score: 0.5923979869528138
female best score: 0.5923979869528138


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4493248564667285


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5502857055355383


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5479703511598357


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.5963078460622988


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.5056520967912218
male best score: 0.47360997862685783
female best score: 0.47360997862685783


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5166100708657413


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5406249363371225


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.5227573489502486


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.5104720659189892
male best score: 0.47195959775341106
female best score: 0.47195959775341106


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5337580697709017


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5574877275229728


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.5286269880764924


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.5035195721699378
male best score: 0.6573433502093078
female best score: 0.6573433502093078


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5874109463917301


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5925679296678121


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6699385741981998


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.4694311624677473
male best score: 0.6514036197827868
female best score: 0.6514036197827868


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.596916686412668


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.586005449144031


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.677878429388993


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.533047626796374
male best score: 0.6573433502093078
female best score: 0.6573433502093078


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5874109463917301


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5925679296678121


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6735520775704156


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.4694311624677473
male best score: 0.6514036197827868
female best score: 0.6514036197827868


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.596916686412668


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.586005449144031


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6838974763777571


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.533047626796374
male best score: 0.6760367652744302
female best score: 0.6760367652744302


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5585887464874112


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5648746789575947


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6807472136793739


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.47688036899087916
male best score: 0.6581711476214709
female best score: 0.6581711476214709


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5905351683660105


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5964900371697472


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6906335050311843


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.5145733034109798
male best score: 0.6586304137468453
female best score: 0.6586304137468453


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.43856648065210513


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5467404406736653


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.5606632340329181


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6745670722207533


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.46598140344145156
male best score: 0.6554103022222113
female best score: 0.6554103022222113


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', LogisticRegression())]),
             n_jobs=-1, param_grid={'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for LogisticRegression is:0.4329778822905296


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', KNeighborsClassifier())]),
             n_jobs=-1,
             param_grid={'model__algorithm': ['auto', 'ball_tree', 'kd_tree',
                                              'brute'],
                         'model__n_jobs': [-1], 'model__n_neighbors': [3, 4]},
             scoring=make_scorer(f1_score, average=macro))

the score for KNN is:0.5894785963962595


GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('model', DecisionTreeClassifier())]),
             n_jobs=-1,
             param_grid={'model__max_depth': [14, 12, 10, 8],
                         'model__random_state': [42]},
             scoring=make_scorer(f1_score, average=macro))

the score for DecisionTreeClassifier is:0.6089430388625983


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SGDClassifier())]),
             n_jobs=-1,
             param_grid={'model__alpha': [0.0001, 0.0002, 0.0003, 0.0005],
                         'model__learning_rate': ['optimal', 'invscaling',
                                                  'constant', 'adaptive'],
                         'model__loss': ['hinge', 'log_loss', 'modified_huber',
                                         'squared_hinge', 'perceptron',
                                         'squared_error', 'huber'],
                         'model__n_jobs': [-1],
                         'model__penalty': ['l1', 'l2', 'elasticnet', None]},
             scoring=make_scorer(f1_score, average=macro))

the score for SGDClassifier is:0.6935259834953293


GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

GridSearchCV(cv=10, estimator=Pipeline(steps=[('model', SVC())]), n_jobs=-1,
             param_grid={'model__kernel': ['linear', 'poly']},
             scoring=make_scorer(f1_score, average=macro))

the score for SVC is:0.5163966203145671
total best score is:0.70662548754468
total best parm for model is:{'model__alpha': 0.0002, 'model__learning_rate': 'optimal', 'model__loss': 'perceptron', 'model__n_jobs': -1, 'model__penalty': 'elasticnet'}
total best parm for vector isTfidfVectorizer() with vectorizer:{'max_df': 0.99, 'min_df': 0.01, 'ngram_range': (1, 2), 'binary': True, 'normalize': True}


In [45]:
X_test=df_test.copy(deep=True)
X_train=best_dataframe_vectorized.copy(deep=True)
vector,X_Test_Vectorized=vectroziering(best_vectorizer,best_vec_parm,X_test)
columns = set(X_Test_Vectorized.columns).intersection(best_dataframe_vectorized.columns)#get the common columns in both dfs
X_Test_Vectorized=X_Test_Vectorized[columns]
X_train=X_train[columns]
model=best_estimator.fit(X_train,y_male)
y_pred=model.predict(X_Test_Vectorized)#set as np array
y_pred=pd.Series(['m' if pred == 1 else 'f' for pred in y_pred])
df_predicted=pd.concat([X_test['test_example_id'], y_pred], axis=1)





# below are  different ways to go through the project that i tried


# # failed attempt at wordnet
#the idea was to use only words that wordnet recongnizes,it works but prediction is low

# old data cleaning -works but is longer and less efficient

# logic regression with pipeline and searchgridcv

# neutral network with pipeline and searchgridcv

# param_grid = {
    'vector__min_df': [ 0.027, 0.03,0.033],
    'vector__max_df': [0.9, 0.95],
    'vector__ngram_range': [(1, 5)],
    'vector__analyzer' : ["char","word"],
    'normalize__norm' : ["l1","l2"],
    "model__hidden_layer_sizes" : [(16,8),(32,16)]
}

**knn model**  

limited max features to 1000

normalized to l2


# copy for testing#

# **knn model**
 
min_df=0.1(ignore terms that appear less than 5%)

max df=0.6 (ignore terms that appear more than 60%)

ngram=1 to 3

# **knn model**
so far this has the best score of 0.6123324178656229

analysing as char
 
min_df=0.1(ignore terms that appear less than 10%)

max df=0.6 (ignore terms that appear more than 60%)

ngram=3 to 8 (words that are in length 3 charchers to 8 charchers)

stop_words=" " (space) (this addition stop words doesnt change the score in this case i kept it)

max_features=10000(unrelevent since amount of features at this setting are  ~8k,will keep it anyway)

# **knn model**
 
min_df=0.1(ignore terms that appear less than 10%)

max df=0.6 (ignore terms that appear more than 60%)

ngram=1 to 3

# ** Gaussian Naive Bayes**

**char**

ngram=(3,6)

analyser=char we are analysing words from 3 to 6 chars long

min_df=0.05 feature must be present at atleast 5% of the stories

max_df=0.7 feature must not be present at >=80% of stories

stop_words=" " (space) this time it does matter


# ** Gaussian Naive Bayes**

ngram=(2,4)

analyser=word we are analysing ngrams from 3 to 6 chars long

min_df=0.1 feature must be present at atleast 10% of the stories

max_df=0.8 feature must not be present at >=80% of stories


# **decision tree**
takes a very long time to process and gets ~0.6 score
wont do alot of tests for it

# SGDClassifier 

mostly used for linear but works on catagorical target values too
it calculate the sum of residual depending on intercept and find the intercept with the lowest sum of residual then put it in the formula



# svc

# logic regression

# forest

# neural network

KNeighborsClassifier

Gaussian Naive Bayes

SGDClassifier

DecisionTreeClassifier

SVC

LogisticRegression

MLPClassifier

LinearSVC

todo#
*change best_score = cross_val_score(model, X_train_normalized,df_train["gender"] , cv=10,scoring=make_scorer(f1_score, average='macro')) for everyone
*input gridsearchcv


### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [46]:
df_predicted.to_csv('classification_results.csv',index=False)