## Validation of users by Income + job title regression

In this notebook, I aim to perform the final validation of the user voordinates via regression between Income and estimated SES. 
The validation includes:
- Simple linear Regression and multiple linear regression (ordinary least squares) on coordinates and income for nine different models*
- *Model comparison between nine different configurations of markers included in the network
    - Goofness of fit characteristics
- End result is identification of the best marker configuration for obtaining the best linreg fit

In [64]:
import os
import re
import sys
import numpy as np
import pandas as pd
import matplotlib
import spacy
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import FrenchStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

import importlib
# Local application imports
sys.path.insert(0, '../Utility files')
import utils2
from utils2 import *

## 1 Connect job titles and income information to all coordinate files

First and foremost, I need to add all the income + job information per user to each of the nine coordinate files m1-m9
Create a function that:
- Iterates over each file system, from n=1 to n=9
- Removes all users that do not occur in the job title file onlygreenscleaned

In [23]:
#Load the income + job title data
path= '/home/livtollanes/10.jan-thesis/Code/Validation/Data/'
file = 'onlygreens_cleaned.csv'
jobs = pd.read_csv(path + file, sep = ',', index_col= 0)

#change type of id col to enable merge with coordinate file
jobs['follower_id'] = jobs['follower_id'].replace(',', '.', regex=True).astype(float).astype('int64')

In total, we have 11 684 unique users with a job title after preprocessing

In [21]:
# #Some job category stats
# pd.set_option('display.max_rows', None)
# jobs['title'] = jobs['title'].str.strip()
# print(jobs['PCS_ESE_name'].value_counts().sort_index())

#jobs['PCS_ESE_name'].nunique() #111 key words/titles, and 58 job categories

PCS_ESE_name
Adjoints administratifs des collectivités locales                                                 347
Aides à domicile, aides ménagères, travailleuses familiales                                         4
Aides-soignants                                                                                    34
Allocataires de la recherche publique                                                             273
Animateurs socioculturels et de loisirs                                                             6
Architectes salariés                                                                               85
Artisans salariés de leur entreprise                                                               45
Artistes de la danse                                                                               16
Artistes de la musique et du chant                                                                356
Artistes dramatiques                                                 

In [69]:
#Create the user coordinate files for all models - for users with job titles
for file_number in range(1, 10):  # oop over the coordinate files for models 1 to 9
    filtered_df = utils2.filter_add_jobs_coords(file_number, jobdf=jobs)
    # You may want to do something with filtered_df here, like saving it to a file

Constructed file path: /home/livtollanes/NewData/coordinates/m1_coords/m1_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m2_coords/m2_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m3_coords/m3_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m4_coords/m4_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m5_coords/m5_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m6_coords/m6_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m7_coords/m7_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m8_coords/m8_row_coordinates.csv
Constructed file path: /home/livtollanes/NewData/coordinates/m9_coords/m9_row_coordinates.csv


In [67]:
importlib.reload(utils2)

<module 'utils2' from '/home/livtollanes/10.jan-thesis/Code/Validation/../Utility files/utils2.py'>

In [96]:
file_number = 1
file_path = f"/home/livtollanes/NewData/job_title_coordinates/m{file_number}_jobs_rowcoords.csv"

df1 = pd.read_csv(file_path, sep = ',')

In [76]:
df2.head()

Unnamed: 0,follower_id,0,1,2,3,screen_name,key_word,description_cleantext,PCS_ESE,PCS_ESE_name,Salaire moyen en EQTP,title
0,1000096085589798912,-0.270807,0.063271,-0.817916,0.029445,GirardMatheo1,journaliste,Journaliste @sports_ouest,352a,Journalistes (y c. rédacteurs en chef),3500.0,journaliste
1,1000279182,-0.084768,-0.646801,0.123394,0.265548,inessbarbier,responsable relations,responsable des relations presse de #lequipe21...,375b,Cadres des relations publiques et de la commu...,4400.0,responsable relations
2,1000447409804242944,0.224374,-0.249452,-0.346344,-0.032713,mandygraillon,adjointe maire,2ème adjointe au Maire d’ #Arles chargée de la...,523c,Adjoints administratifs des collectivités locales,1900.0,adjoint maire
3,1000498107329720320,0.277393,-0.349849,-0.329667,-0.163511,FlorianMathevon,développeur,Développeur Web PHP/JavaScript #Symfony- École...,478a,Techniciens d'étude et de développement en in...,2400.0,développeur
4,100099560,-1.419103,0.262383,1.559013,-2.245815,Minifish57,développeur,Analyste développeur et le plus important je s...,478a,Techniciens d'étude et de développement en in...,2400.0,développeur


In [97]:
print(df1.shape) #something is incorrect, because df1 should have the same amount of rows as the job file. Figure out why. Double check the filter function and how it does the thing with follower_ids. 
print(df2.shape)
print(df3.shape)
print(df4.shape)
print(df5.shape)
print(df6.shape)
print(df7.shape)
print(df8.shape)
print(df9.shape)

(10713, 12)
(10712, 12)
(10123, 12)
(9884, 12)
(9833, 12)
(9838, 12)
(3305, 12)
(10579, 12)
(3736, 12)
