# Preliminary

This notebook regroups statistical analysis of the descriptions before and after cleaning. We used it to orient and decide wich classification strategy would then be the best. <br>
**Make Sure That you have dowloaded the datasets required for the challenge and specify their DATA_PATH, also make sure to put them in the same file.**

# Library

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import collections

#Plotly
import plotly.graph_objects as go
import plotly

# Seaborn
import plotly.express as px
import seaborn as sb
sb.set(color_codes=True)

In [None]:
import os
os.getcwd()

# Data

In [None]:
#To fill : Advice : Set the path as the one in which you saved this notebook
#Save the datasets in the same file, the code will run without problem !
DATA_PATH = 


train_df = pd.read_json(DATA_PATH+"/train.json")
train_df.set_index('Id', inplace=True) 

test_df = pd.read_json(DATA_PATH+"/test.json")
test_df.set_index('Id', inplace=True) 


train_label = pd.read_csv(DATA_PATH+"/train_label.csv")
train_label.set_index('Id', inplace=True)

categ = pd.read_csv(DATA_PATH+'/categories_string.csv')

template_submissions = pd.read_csv(DATA_PATH + "/template_submissions.csv")

In [None]:
pd.options.display.max_colwidth = 1000
display(train_df.head(5))

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(train_df.head(5))

In [None]:
train_tot=pd.merge(train_df, train_label)
train_tot=pd.merge(train_tot, categ)
train_tot.head(5)

In [None]:
test_df.head(3)

In [None]:
template_submissions.head(3)

# Analyse du jeu de données d'entrainement

In [None]:
print("Number of jobs in the dataset : %d" %(len(train_tot.Category.unique())))
print("Number of people in the dataset : %d" %(len(train_tot.Id.unique())))

### Nombre de personnes par métier

In [None]:
fig = px.histogram(train_tot, x="Category name", color="gender")
fig.show()

### Nombre de personnes par métier sous représenté

In [None]:
train_gb_category=train_tot.groupby('Category')

In [None]:
categ_kept=categ[(train_gb_category.count()['Id']<10000).values]['Category']

In [None]:
fig = px.histogram(train_tot[train_tot['Category'].isin(categ_kept)], x="Category name", color="gender")
fig.show()

Les métiers sont représentés de façon inéquitable, il y a par exemple beaucoup de professeurs mais peu de rappeur. De plus, les sexes sont également très mal réparties entre les différents métiers.

## Etude des descriptions :

In [None]:
length_descriptions = (train_tot.description.map(lambda train_tot :len(train_tot)))
print("Plus longue description :", max(length_descriptions), "mots")

In [None]:
train_tot.head(10)

In [None]:
px.box(length_descriptions)

In [None]:
from wordcloud import WordCloud
all_descr = " ".join(train_tot.description.values)
wordcloud_word = WordCloud(background_color="black", collocations=False).generate_from_text(all_descr)

In [None]:
plt.figure(figsize=(10,10))
plt.imshow(wordcloud_word,cmap=plt.cm.Paired)
plt.axis("off")
plt.show()

## Analysis with Description Cleaned : 

In [None]:
DATA_CLEANED_PATH = DATA_PATH + '/cleaned'
ct = CleanText(stemming=True, lem=False)
ct.clean_save(train_tot, 'train_tot', "description", "description_cleaned", DATA_CLEANED_PATH)

In [None]:
train_tot_clean = pd.read_csv(os.path.join(DATA_CLEANED_PATH,'train_tot_cleaned_stem.csv'),index_col=0)

In [None]:
train_tot_clean.head(5)

You can see here the difference between the original description and the cleaned one; this will help our further classification.

In [None]:
length_descriptions_clean = (train_tot_clean.description_cleaned.map(lambda train_tot_cleaned :len(train_tot_cleaned)))
print("Plus longue description :", max(length_descriptions_clean), "mots")

In [None]:
px.box(length_descriptions_clean)

In [None]:
all_descr_clean = " ".join(train_tot_clean.description_cleaned.values)
wordcloud_word = WordCloud(background_color="black", collocations=False).generate_from_text(all_descr_clean)

plt.figure(figsize=(10,10))
plt.imshow(wordcloud_word,cmap=plt.cm.Paired)
plt.axis("off")
plt.show()

In [None]:
occurences = train_tot_clean.description_cleaned.str.split(expand=True).stack().value_counts()

In [None]:
len(occurences)

In [None]:
cc=pd.DataFrame(collections.Counter(occurences.values).items())
print(cc[cc[0]<=5])

In [None]:
nb = 500
px.bar(y=occurences.head(nb).values,x=occurences.head(nb).index, labels={'x': 'Words', 'y': 'Count'})

Words are clearly not represented in the same proportion.

**Comment travailler avec ce problème ?**

* Au niveau du classifieur :
    * Random Forest : class_weight = balanced
    * Boosting : scale_pos_weight = ...
    
* Au niveau du dataset : dupliquer les lignes de métiers sous représenté