## Importing the data
I first import the training and testing data. I aggregate all testing dataset for each company into a unique testing dataset.

In [1]:
# import the libraries I need
import pickle
import pandas as pd
import numpy as np
import warnings
import os
import json
warnings.filterwarnings("ignore")

In [2]:
#import the training data 
training = pd.read_pickle('labelled_dataset.pickle')


In [3]:
# get the testing data
cwd = os.getcwd()
# define the data directory
data_dir = cwd +"/unlabelled-dataset/"
file_list = os.listdir(data_dir)

# loop inside the list of companies' files and aggregate them
dummy = pd.DataFrame(columns=['text','company'])
testing = []#pd.DataFrame(columns=['text'])
for k in file_list:
    with open(data_dir + k, 'r') as f:
        array = json.load(f)
    
    #print (array)

    for i in range(len(array)):
        dummy= [array[i]['text'],k[:-5]]
        testing.append(dummy)

 # define the testing dataset                     
testing = pd.DataFrame(testing,  columns= ['text', 'company'])




I now convert the words into ad dictionary: I vectorialise the sentences using CountVectorizer which takes the words of each sentence and creates a vocabulary of all the unique words in the sentences

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
sentences_train = training["text"].values
vectorizer.fit(sentences_train)
X_train = vectorizer.transform(sentences_train)
X_test  = vectorizer.transform(testing["text"].values)
#X_train
y = training['labelmax'].values

## Training
I first try a naïve Bayes classification and then the most natural logistic regression:

### Naïve Bayes
I fit the model and look at the accuracy of training

In [5]:
from sklearn import naive_bayes

# fit the training dataset on the NB classifier
Naive = naive_bayes.MultinomialNB()

Naive.fit(X_train,y)
scoreNB_over = Naive.score(X_train, y)


# predict the labels on validation dataset
print("Accuracy during training (NB)", scoreNB_over)


Accuracy during training (NB) 0.6349967241755842


The Naïve Bayes does not reach the 90% 

### Logistic  regression
I try now with the logistic regression

In [6]:
# train 
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(multi_class = 'auto', n_jobs = -1);
classifier.fit(X_train, y);


I check that I've reached the 90% required accuracy:

In [7]:
scoreLG = classifier.score(X_train, y)
print ('Accuracy of the training with LG', scoreLG)

Accuracy of the training with LG 0.9586263376283031


#### Accuracy is above 90%!

##### Remark:  
I splited the training data in two subgroups which I again call training and test. I use this approach to see the quality of the modelling testing on the subset not seing by the classifier during training. This is not crucial that is why I've comment this out. Accuracy on the randomly chosen testing data was about 83%. 


In [8]:
# ########
# # I split the training data in two subgroups which I again call training and test. 
# # I use this test to see the quality of the modeling. 
# ########
# from sklearn.model_selection import train_test_split
# sentences = training['text'].values
# y = training['labelmax'].values

# sentences_train, sentences_test, y_train, y_test = train_test_split(sentences, y, test_size=0.1, random_state=1000)

# vectorizer = CountVectorizer();
# vectorizer.fit(sentences_train);
# X_train = vectorizer.transform(sentences_train);
# X_test  = vectorizer.transform(sentences_test);
# X =  vectorizer.transform(sentences)

# from sklearn.linear_model import LogisticRegression
# #logistic regression
# classifier = LogisticRegression(multi_class = 'auto', n_jobs = -1);

# # train withh the whole dataset
# classifier.fit(X, y);
# score_overLG = classifier.score(X, y)

# # train with a subsample of the training data to test overfit
# classifier.fit(X_train, y_train);
# scoreLG = classifier.score(X_test, y_test)
# print("Accuracy during training", score_overLG)
# print("Accuracy on a sub-sample of training", scoreLG)


## Testing
I now use the model learnt during training to make the predictions with the unlabelled dataset:

In [9]:
# predict
labels = pd.DataFrame(classifier.predict(X_test), columns = ['labels'])

I then define the Result dataet and export it

In [10]:
Results = pd.concat([testing,labels],axis = 1)
Results.head()

Unnamed: 0,text,company,labels
0,Pros - super freundliche Kollegen - Weiterbild...,123makler,
1,Pros - Fast pace - Lots of opportunities to gr...,Aaptiv,collaboration
2,Pros Great leadership--team has the right focu...,Aaptiv,collaboration
3,Pros The atmosphere and people create an amazi...,Aaptiv,adaptability
4,Pros I really can't say enough positive things...,Aaptiv,collaboration


I export the data to a csv

In [11]:
Results.to_csv('resultsPredictions.csv',index=False)


### How can one improve the analysis
The crucial parts for improving the analysis are:
1. pre-processing of the data, i.e. future engineering
2. the choice of the classifier

#### 1. Pre-processing
By looking at the data we see the we have at least two languages: German and English. This should be taken into account, as suggested. Besides the language differences, one could take into account, e.g.:
    1. ngrams
    2. TF-IDF to highlight more interesting words at review level
    3. remove stop words

#### 2. Classifier
    1. One could use different algorithms to perform the task. Among the most popular we have Naïve Bayes, Bayesian mixture models, SVM, tree-based approaches (gradient boosting, random-forest). Having time, the best strategy would be, clearly, to use a subsample of the training data, fit all the models and select the one which has the best performance. For this final step, we could use an n-fold cross-validation for each method and test the difference between the accuracies.
    2. For each algorithm, we could also make a grid-search optimisation of the parameters.
    
#### I should use this analysis for business purposes? A remark.
In case I should use this analysis for business purposes, I would need to present the results, e.g., to a customer. I would probably suggest to have at least two main visualisations. One would aggregate the results at company level using a voting scheme: the dimension receiving the most number of votes would characterise the current company culture. The second, more interesting approach, would be to characterise the company using all six dimensions. To visualise the data we could use a radar-plot. This second approach would allow to have a visual feedback about the current cultural-state and would highlight the dimensions on which the company should work. As Bunch, we could then propose actions (workshops, seminars, ..) to support the company to improve specific dimensions and use the radar-plots pre- and post-action to quantify the impact of the action itself.     


In [12]:
# Count the number of reviews
ResultsGrouped = Results.groupby(["company"]).count()
ResultsGrouped

# AOL has 1255 I get  then only this company
ResultsAOL = Results[Results["company"] == "AOL"]
ResultsAOL = ResultsAOL[ResultsAOL["labels"]!='null']
#count the number of reviews per each dimension and normalize as percentages
ResultsAOL_dimension = 100*ResultsAOL.groupby(["labels"]).count()/1255
ResultsAOL_dimension

#radar visualisation

from math import pi
import matplotlib.pyplot as plt


Categories = ResultsAOL["labels"].unique() 
N = len(Categories)
Values = ResultsAOL_dimension["text"].tolist()
Values.append(Values[0]) #add value for periodicity
# What will be the angle of each axis in the plot? (we divide the plot / number of variable)
Angles = [n / float(N) * 2 * pi for n in range(N)]
Angles += Angles[:1]
np.shape((Values))

# Initialise the spider plot
ax = plt.subplot(111, polar=True)
 
# Draw one axe per variable + add labels labels yet
plt.xticks(Angles[:-1], Categories, color='grey', size=8)
 
# Draw ylabels
ax.set_rlabel_position(0)
plt.yticks([10,20,30], ["10","20","30"], color="grey", size=7)
plt.ylim(0,40)
 
# Plot data
ax.plot(Angles, Values, linewidth=1, linestyle='solid')
 
# Fill area
ax.fill(Angles, Values, 'b', alpha=0.1);

### How to go into production
This clearly depends on the business case. Important factor for deciding are the local infrastructure where the classifier should be available, how the classifier need to be used, and the customer's needs and constraints including legal limitations.
We could have two possible approaches:
1. put everything into a container which will be deployed on the server of a customer, for instance.
2. expose an end-point to which one can request a prediction and get results. For this, one could use Azure or AWS. The basic workflow would be:
<img src="workflow.png" style="width: 700px;" align=”left”/>
    