# Assignment 3 - Text Analysis
An explanation this assignment could be found in the .pdf explanation document


## Materials to review for this assignment
<h4>From Moodle:</h4> 
<h5><u>Review the notebooks regarding the following python topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Working with strings</b> (tutorial notebook)<br/>
&#x2714; <b>Text Analysis</b> (tutorial notebook)<br/>
&#x2714; <b>Hebrew text analysis tools (tokenizer, wordnet)</b> (moodle example)<br/>
&#x2714; <b>(brief review) All previous notebooks</b><br/>
</div> 
<h5><u>Review the presentations regarding the following topics</u>:</h5>
<div class="alert alert-info">
&#x2714; <b>Text Analysis</b> (lecture presentation)<br/>
&#x2714; <b>(brief review) All other presentations</b><br/>
</div>

## Personal Details:

In [None]:
# Details Student 1: Hadar Asher, 207767005, hadarasher99@gmail.com
# Details Student 2: Or Milles, 322721663

## Preceding Step - import modules (packages)
This step is necessary in order to use external modules (packages). <br/>

In [None]:
# --------------------------------------
import pandas as pd
import numpy as np
# --------------------------------------


# --------------------------------------
# ------------- visualizations:
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
# --------------------------------------


# ---------------------------------------
import sklearn
from sklearn import preprocessing, metrics, pipeline, model_selection, feature_extraction 
from sklearn import naive_bayes, linear_model, svm, neural_network, neighbors, tree
from sklearn import decomposition, cluster

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import mean_squared_error, r2_score, silhouette_score
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder

from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import Perceptron, SGDClassifier
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
# ---------------------------------------


# ----------------- output and visualizations: 
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%matplotlib inline
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# ---------------------------------------

### Text analysis and String manipulation imports:

In [None]:
# --------------------------------------
# --------- Text analysis and Hebrew text analysis imports:
# vectorizers:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# regular expressions:
import re
# --------------------------------------

### (optional) Hebrew text analysis - WordNet (for Hebrew)
Note: the WordNet is not a must

#### (optional) Only if you didn't install Wordnet (for Hebrew) use:

In [None]:
# word net installation:

# unmark if you want to use and need to install
!pip install wn
!python -m wn download omw-he:1.4

In [None]:
# word net import:

# unmark if you want to use:
import wn

### (optional) Hebrew text analysis - hebrew_tokenizer (Tokenizer for Hebrew)
Note: the hebrew_tokenizer is not a must

#### (optional) Only if you didn't install hebrew_tokenizer use:

In [None]:
# Hebrew tokenizer installation:

# unmark if you want to use and need to install:
!pip install hebrew_tokenizer

In [None]:
# Hebrew tokenizer import:

# unmark if you want to use:
import hebrew_tokenizer as ht

### Reading input files
Reading input files for train annotated corpus (raw text data) corpus and for the test corpus

In [None]:
train_filename = 'annotated_corpus_for_train.csv'
test_filename  = 'corpus_for_test.csv'
df_train = pd.read_csv(train_filename, index_col=None, encoding='utf-8')
df_test  = pd.read_csv(test_filename, index_col=None, encoding='utf-8')

In [None]:
df_train.head(3)
df_train.shape

In [None]:
df_test.head(3)
df_test.shape

### Your implementation:
Write your code solution in the following code-cells

### Data Cleansing

In [None]:
df_train.describe(include='all')

There are no missing data.</br>
We can see that there are few duplicates. We'll found and drop them.



In [None]:
df_train.duplicated().sum()

In [None]:
df_train.drop_duplicates(inplace=True)

In [None]:
df_train.info()

In [None]:
df_train.dtypes

<b>Feature Engeneering -</b></br>
Transfer 'gender'column from string values to numeric

In [None]:
df_train['gender']=df_train['gender'].replace({'m':0,'f':1})
df_train.head(3)

## Visualizations

In [None]:
# barplot of male and female statistics

## text analysis

starting with cleaning the text, keeping Hebrew characters only

In [None]:
def keep_hebrew(text):
    return re.sub(r'[^א-ת\s]', '', text)

In [None]:
def tokenize_hebrew(text):
    tokens = ht.tokenize(text)
    token_list = []
    for token in tokens:
        token_list.append(token[1])
    return token_list

starting with tokenization using hebrew_tokenizer

### Text Vectorization

In [None]:
from sklearn.naive_bayes import MultinomialNB

pipeline_mnb = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=tokenize_hebrew)),
    ('clf', MultinomialNB())
])

param_grid_mnb = {
    'tfidf__max_features': [1000, 5000, 10000],
    'tfidf__use_idf': [True, False],
    'tfidf__smooth_idf': [True, False],
    'clf__alpha': [0.01, 0.1, 1.0, 10.0],
    'clf__fit_prior': [True, False]
}

grid_search_mnb = GridSearchCV(pipeline_mnb, param_grid_mnb, cv=10, scoring='f1_macro')
grid_search_mnb.fit(df_train["story"], df_train["gender"])

print("Best Parameters (MultinomialNB):", grid_search_mnb.best_params_)
print("Best F1 Macro Score (MultinomialNB):", grid_search_mnb.best_score_)



In [None]:
from sklearn.linear_model import Perceptron

pipeline_perceptron = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=tokenize_hebrew)),
    ('scaler', StandardScaler(with_mean=False)),
    ('clf', Perceptron())
])

param_grid_perceptron = {
    'tfidf__max_features': [1000, 5000, 10000],
    'tfidf__use_idf': [True, False],
    'tfidf__smooth_idf': [True, False],
    'clf__alpha': [0.0001, 0.001, 0.01],
    'clf__penalty': [None, 'l1', 'l2', 'elasticnet'],
    'clf__class_weight': [None, 'balanced'],
    'clf__max_iter': [100, 1000]
}

grid_search_perceptron = GridSearchCV(pipeline_perceptron, param_grid_perceptron, cv=10, scoring='f1_macro')
grid_search_perceptron.fit(df_train["story"], df_train["gender"])

print("Best Parameters (Perceptron):", grid_search_perceptron.best_params_)
print("Best F1 Macro Score (Perceptron):", grid_search_perceptron.best_score_)


In [None]:
from sklearn.linear_model import SGDClassifier

pipeline_sgd = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=tokenize_hebrew)),
    ('scaler', StandardScaler(with_mean=False)),
    ('clf', SGDClassifier())
])

param_grid_sgd = {
    'tfidf__max_features': [1000, 5000, 10000],
    'tfidf__use_idf': [True, False],
    'tfidf__smooth_idf': [True, False],
    'clf__alpha': [0.0001, 0.001, 0.01],
    'clf__penalty': ['l2', 'l1', 'elasticnet'],
    'clf__loss': ['hinge', 'log', 'modified_huber'],
    'clf__class_weight': [None, 'balanced'],
    'clf__max_iter': [100, 1000]
}

grid_search_sgd = GridSearchCV(pipeline_sgd, param_grid_sgd, cv=10, scoring='f1_macro')
grid_search_sgd.fit(df_train["story"], df_train["gender"])

print("Best Parameters (SGDClassifier):", grid_search_sgd.best_params_)
print("Best F1 Macro Score (SGDClassifier):", grid_search_sgd.best_score_)




In [None]:
from sklearn.neighbors import KNeighborsClassifier

pipeline_knn = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=tokenize_hebrew)),
    ('scaler', StandardScaler(with_mean=False)),
    ('clf', KNeighborsClassifier())
])

param_grid_knn = {
    'tfidf__max_features': [1000, 5000, 10000],
    'tfidf__use_idf': [True, False],
    'tfidf__smooth_idf': [True, False],
    'clf__n_neighbors': [3, 5, 7],
    'clf__weights': ['uniform', 'distance'],
    'clf__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'clf__p': [1, 2]
}

grid_search_knn = GridSearchCV(pipeline_knn, param_grid_knn, cv=10, scoring='f1_macro')
grid_search_knn.fit(df_train["story"], df_train["gender"])

print("Best Parameters (KNeighborsClassifier):", grid_search_knn.best_params_)
print("Best F1 Macro Score (KNeighborsClassifier):", grid_search_knn.best_score_)



In [None]:
from sklearn.tree import DecisionTreeClassifier

pipeline_dt = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=tokenize_hebrew)),
    ('clf', DecisionTreeClassifier())
])

param_grid_dt = {
    'tfidf__max_features': [1000, 5000, 10000],
    'tfidf__use_idf': [True, False],
    'tfidf__smooth_idf': [True, False],
    'clf__criterion': ['gini', 'entropy'],
    'clf__max_depth': [None, 10, 20],
    'clf__min_samples_split': [2, 5, 10]
}

grid_search_dt = GridSearchCV(pipeline_dt, param_grid_dt, cv=10, scoring='f1_macro')
grid_search_dt.fit(df_train["story"], df_train["gender"])

print("Best Parameters (DecisionTreeClassifier):", grid_search_dt.best_params_)
print("Best F1 Macro Score (DecisionTree



In [None]:
from sklearn.svm import LinearSVC

pipeline_svc = Pipeline([
    ('tfidf', TfidfVectorizer(tokenizer=tokenize_hebrew)),
    ('scaler', StandardScaler(with_mean=False)),
    ('clf', LinearSVC())
])

param_grid_svc = {
    'tfidf__max_features': [1000, 5000, 10000],
    'tfidf__use_idf': [True, False],
    'tfidf__smooth_idf': [True, False],
    'clf__C': [0.1, 1.0, 10.0],
    'clf__class_weight': [None, 'balanced'],
    'clf__max_iter': [100, 1000]
}

grid_search_svc = GridSearchCV(pipeline_svc, param_grid_svc, cv=10, scoring='f1_macro')
grid_search_svc.fit(df_train["story"], df_train["gender"])

print("Best Parameters (LinearSVC):", grid_search_svc.best_params_)
print("Best F1 Macro Score (LinearSVC):", grid_search_svc.best_score_)



### Train Model

#### split the train set to X set and y

#### use Cross Validation

### Save output to csv (optional)
After you're done save your output to the 'classification_results.csv' csv file.<br/>
We assume that the dataframe with your results contain the following columns:
* column 1 (left column): 'test_example_id'  - the same id associated to each of the test stories to be predicted.
* column 2 (right column): 'predicted_category' - the predicted gender value for each of the associated story. 

Assuming your predicted values are in the `df_predicted` dataframe, you should save you're results as following:

In [None]:
df_predicted.to_csv('classification_results.csv',index=False)