# Text representation for better defined categories

An analysis from the text_representation notebook allowed us to realise that the category 'actualities' is very hard to distinguish from other categories, as it can contain news of many type. We thus decide to better define our categories before transforming them into vector representation. In particular, we remove the vague category. 

### Reading the files and importing the libraries

In [2]:
import numpy as np
import pandas as pd

In [3]:
pd.set_option('display.max_rows', 100)

In [30]:
path = '/Users/louispht/Dropbox/git_projects/news_classifier/Data cleaning/Csv_files_cleaned/'
df_all = pd.read_csv(path + 'stem_all.csv', index_col = 0, lineterminator='\n').reset_index(drop=True)

In [31]:
df_train = pd.read_csv(path + 'stem_train.csv', index_col = 0, lineterminator='\n').reset_index(drop=True)

In [32]:
df_test = pd.read_csv(path +'stem_test.csv', index_col = 0, lineterminator='\n').reset_index(drop=True)

In [9]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13807 entries, 0 to 13806
Data columns (total 6 columns):
title              13807 non-null object
content            13807 non-null object
link               13807 non-null object
category           13807 non-null object
news_length        13807 non-null int64
cleaned_content    13807 non-null object
dtypes: int64(1), object(5)
memory usage: 647.3+ KB


In [10]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12335 entries, 0 to 12334
Data columns (total 6 columns):
title              12335 non-null object
content            12335 non-null object
link               12335 non-null object
category           12335 non-null object
news_length        12335 non-null int64
cleaned_content    12335 non-null object
dtypes: int64(1), object(5)
memory usage: 578.3+ KB


In [11]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1472 entries, 0 to 1471
Data columns (total 6 columns):
title              1472 non-null object
content            1472 non-null object
link               1472 non-null object
category           1472 non-null object
news_length        1472 non-null int64
cleaned_content    1472 non-null object
dtypes: int64(1), object(5)
memory usage: 69.1+ KB


In [9]:
df_all.head(5)

Unnamed: 0,title,content,link,category,news_length,cleaned_content
0,"Oui, Marie-Josée Lord est une chanteuse d'opéra!",«Je ne suis pas une chanteuse d'opéra.» Marie-...,https://www.lapresse.ca/arts/festivals/montrea...,culture,3137,chanteux oper mariejos lord laiss tomb c...
1,Une vallée sans foi ni loi,C'était une paisible vallée agricole où les po...,https://www.lapresse.ca/international/amerique...,international,7051,paisibl vall agricol où polici regl chican...
2,Les profs dénoncent la hausse artificielle des...,Les cibles de réussite que le ministère de l'É...,https://www.lapresse.ca/actualites/education/2...,actualites,5624,cibl réussit minister éduc impos depuis a...
3,18 millions pour l'ambassade et des résidences...,Bien avant que le gouvernement conservateur de...,https://www.lapresse.ca/actualites/dossiers/le...,actualites,3431,bien avant gouvern conserv stephen harp co...
4,La semaine s'annonce explosive à l'Hôtel de Ville,"Le vérificateur remettra au conseil municipal,...",https://www.lapresse.ca/actualites/grand-montr...,actualites,2983,vérif remettr conseil municipal soir rappo...


### Removing 'actualites'

As announced, we remove that category from our data set. 

In [33]:
df_all = df_all[df_all.category != 'actualites'].reset_index()
df_train = df_train[df_train.category != 'actualites'].reset_index()
df_test = df_test[df_test.category != 'actualites'].reset_index()

In [34]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9771 entries, 0 to 9770
Data columns (total 7 columns):
index              9771 non-null int64
title              9771 non-null object
content            9771 non-null object
link               9771 non-null object
category           9771 non-null object
news_length        9771 non-null int64
cleaned_content    9771 non-null object
dtypes: int64(2), object(5)
memory usage: 534.5+ KB


In [35]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8641 entries, 0 to 8640
Data columns (total 7 columns):
index              8641 non-null int64
title              8641 non-null object
content            8641 non-null object
link               8641 non-null object
category           8641 non-null object
news_length        8641 non-null int64
cleaned_content    8641 non-null object
dtypes: int64(2), object(5)
memory usage: 472.7+ KB


In [36]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1130 entries, 0 to 1129
Data columns (total 7 columns):
index              1130 non-null int64
title              1130 non-null object
content            1130 non-null object
link               1130 non-null object
category           1130 non-null object
news_length        1130 non-null int64
cleaned_content    1130 non-null object
dtypes: int64(2), object(5)
memory usage: 61.9+ KB


#### LabelEncoding

We create labels for our categories, as it is easier to work with for our models. 

In [10]:
#y_all = pd.get_dummies(df_all.category, prefix='cat')
#y_train = pd.get_dummies(df_train.category, prefix='cat')
#y_test = pd.get_dummies(df_test.category, prefix='cat')

In [37]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df_train.category)
y_all = le.transform(df_all.category)
y_train = le.transform(df_train.category)
y_test = le.transform(df_test.category)

In [38]:
y_all.shape

(9771,)

In [39]:
y_train.shape

(8641,)

In [40]:
y_test.shape

(1130,)

In [41]:
df_y_all = pd.DataFrame(data = y_all, columns = ['label_enc'])
df_y_train = pd.DataFrame(data = y_train, columns = ['label_enc'])
df_y_test = pd.DataFrame(data = y_test, columns = ['label_enc'])

In [42]:
df_y_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9771 entries, 0 to 9770
Data columns (total 1 columns):
label_enc    9771 non-null int64
dtypes: int64(1)
memory usage: 76.5 KB


In [43]:
df_y_all.head()

Unnamed: 0,label_enc
0,1
1,2
2,0
3,0
4,0


In [44]:
df_all['category'].head()

0          culture
1    international
2         affaires
3         affaires
4         affaires
Name: category, dtype: object

In [45]:
df_all_le = pd.concat([df_all, df_y_all], axis=1)
df_test_le = pd.concat([df_test, df_y_test], axis=1)
df_train_le = pd.concat([df_train, df_y_train], axis=1)

In [46]:
df_all_le.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9771 entries, 0 to 9770
Data columns (total 8 columns):
index              9771 non-null int64
title              9771 non-null object
content            9771 non-null object
link               9771 non-null object
category           9771 non-null object
news_length        9771 non-null int64
cleaned_content    9771 non-null object
label_enc          9771 non-null int64
dtypes: int64(3), object(5)
memory usage: 610.8+ KB


In [47]:
df_train_le.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8641 entries, 0 to 8640
Data columns (total 8 columns):
index              8641 non-null int64
title              8641 non-null object
content            8641 non-null object
link               8641 non-null object
category           8641 non-null object
news_length        8641 non-null int64
cleaned_content    8641 non-null object
label_enc          8641 non-null int64
dtypes: int64(3), object(5)
memory usage: 540.2+ KB


In [48]:
df_test_le.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1130 entries, 0 to 1129
Data columns (total 8 columns):
index              1130 non-null int64
title              1130 non-null object
content            1130 non-null object
link               1130 non-null object
category           1130 non-null object
news_length        1130 non-null int64
cleaned_content    1130 non-null object
label_enc          1130 non-null int64
dtypes: int64(3), object(5)
memory usage: 70.8+ KB


In [56]:
df_test_le.sample(10)

Unnamed: 0,index,title,content,link,category,news_length,cleaned_content,label_enc
724,1066,La pénurie est de retour,Le Québec a beau être confronté à un taux de c...,https://www.journaldemontreal.com/2020/05/31/c...,affaires,3930,québec beau être confront taux chômag his...,0
459,634,BTS se hisse au somment du Billboard avec Dyna...,(New York) Nouvelle performance pour le groupe...,https://www.lapresse.ca/arts/musique/2020-08-3...,culture,1214,new york nouvel perform group bt rois kpo...,1
171,346,Clins d’œil technologiques,"Arkangel AI, une entreprise émergente montréal...",https://www.lapresse.ca/affaires/techno/2020-0...,affaires,2285,arkangel entrepris émergent montréalais util...,0
729,1071,Quand les hôteliers se mettent en mode solution,Les besoins de recrutement sont de plus en plu...,https://www.journaldemontreal.com/2020/02/22/q...,affaires,6359,besoin recrut plus plus cri hôtel montr...,0
838,1180,Raonic s’incline devant Djokovic en finale,Milos Raonic a laissé filer une avance d’un se...,https://www.journaldemontreal.com/2020/08/29/t...,sports,1098,milos raonic laiss fil avanc set deux poi...,3
8,183,Le Soudan écarte devant Mike Pompeo une normal...,"(Khartoum, Soudan) Le Soudan a douché mardi le...",https://www.lapresse.ca/international/afrique/...,international,4510,khartoum soudan soudan douch mard espoir i...,2
249,424,Plus de 35 000$ à Enfant Soleil,"Grâce à l’appui de son vaste réseau, HGrégoire...",https://www.lapresse.ca/affaires/tetes-d-affic...,affaires,2955,grâc appui vast réseau hgrégoir rem 31 ju...,0
21,196,Le Brésil compte plus de 120 000 morts du coro...,(Rio de Janeiro) Six mois après avoir recensé ...,https://www.lapresse.ca/international/amerique...,international,4052,rio janeiro six mois apres avoir recens prem...,2
137,312,Bourse de croissance TSX en forte hausse: rega...,L’indice de la Bourse de croissance de Toronto...,https://www.lapresse.ca/affaires/marches/2020-...,affaires,6018,indic bours croissanc toronto plus doubl...,0
945,1287,Bernier se réjouit des récents succès de Davies,Ayant tout juste lancé un livre qui retrace so...,https://www.journaldemontreal.com/2020/08/31/p...,sports,3648,tout just lanc livr retrac parcour québéc...,3


#### Text representation

We use 'TF-IDF' Vectors as our feature engineering. We first set the parameters, these could be adjusted for better results. For example, too much features could cause overfitting issues. 

In [50]:
X_train = df_train['cleaned_content']
X_test = df_test['cleaned_content']

In [51]:
# Setting parameters 
ngram_range = (1,1)
min_df = 300
max_df = 0.75
max_features = 500

In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [53]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
print(features_test.shape)

(8641, 500)
(1130, 500)


We get the feature names. 

In [54]:
feature_names = np.array(tfidf.get_feature_names())

In [55]:
feature_names

array(['000', '10', '11', '12', '15', '18', '20', '2015', '2016', '25',
       '30', 'accord', 'accus', 'action', 'activ', 'actuel', 'administr',
       'affair', 'affirm', 'affront', 'afin', 'afp', 'agenc', 'aid',
       'ailleur', 'aim', 'ains', 'ajout', 'album', 'alor', 'américain',
       'an', 'ancien', 'annonc', 'anné', 'appel', 'apres', 'armé',
       'arriv', 'arrêt', 'artist', 'associ', 'assur', 'attaqu', 'aucun',
       'augment', 'auss', 'aut', 'autor', 'autr', 'avanc', 'avant',
       'avoir', 'avril', 'baiss', 'bas', 'beaucoup', 'bel', 'besoin',
       'bien', 'bless', 'bon', 'but', 'camp', 'campagn', 'canad',
       'canadien', 'capital', 'car', 'carri', 'cas', 'caus', 'cel',
       'celui', 'centr', 'certain', 'cet', 'ceux', 'chanc', 'chang',
       'chanson', 'chaqu', 'chef', 'chez', 'chin', 'choix', 'chos',
       'cinq', 'class', 'club', 'combat', 'comm', 'commenc', 'comment',
       'commun', 'communiqu', 'compagn', 'complet', 'compt', 'confirm',
       'connu', 'con

In [57]:
from sklearn.feature_selection import chi2
categories = {'affaires':0, 'sports':3, 'international':2, 'culture':1}

for cat, label in sorted(categories.items()):
    features_chi2 = chi2(features_train, y_train == label)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    words = [v for v in feature_names if len(v.split(' ')) == 1]
    print("# '{}' category:".format(cat))
    print("  Most correlated words:\n. {}".format('\n. '.join(words[-10:])))
    print("")

# 'affaires' category:
  Most correlated words:
. march
. emploi
. baiss
. dollar
. économ
. invest
. financi
. milliard
. hauss
. entrepris

# 'culture' category:
  Most correlated words:
. tourn
. piec
. the
. musiqu
. scen
. spectacl
. chanson
. artist
. album
. film

# 'international' category:
  Most correlated words:
. républicain
. donald
. tu
. président
. polic
. démocrat
. armé
. militair
. état
. trump

# 'sports' category:
  Most correlated words:
. entraîneur
. tournoi
. lnh
. victoir
. équip
. ligu
. but
. saison
. joueur
. match



The most corelated words make intuitive sense.

Also, one could consider bigrams and trigrams. This could have added more specific features to each categories. 

#### Saving the files

We use Pickle to store the different variables. 

In [59]:
import pickle

In [60]:
# X_train
with open('Pickles/X_train_noact.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test_noact.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train_noact.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test_noact.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df_all
with open('Pickles/df_all_noact.pickle', 'wb') as output:
    pickle.dump(df_all, output)
    
# df_train
with open('Pickles/df_train_noact.pickle', 'wb') as output:
    pickle.dump(df_train, output)
    
# df_test
with open('Pickles/df_test_noact.pickle', 'wb') as output:
    pickle.dump(df_test, output)

# df_all_le
with open('Pickles/df_all_le_noact.pickle', 'wb') as output:
    pickle.dump(df_all_le, output)
    
# df_train_le
with open('Pickles/df_train_le_noact.pickle', 'wb') as output:
    pickle.dump(df_train_le, output)
    
# df_test_le
with open('Pickles/df_test_le_noact.pickle', 'wb') as output:
    pickle.dump(df_test_le, output)
    
# features_train
with open('Pickles/features_train_noact.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# features_test
with open('Pickles/features_test_noact.pickle', 'wb') as output:
    pickle.dump(features_test, output)
    
# TF-IDF object
with open('Pickles/tfidf_noact.pickle', 'wb') as output:
    pickle.dump(tfidf, output)