# Applied Machine Learning

This Jupyter Notebook presents our solution for the Applied ML Homework of the ADA course at EPFL.  

For this homework, we use a [football dataset](CrowdstormingDataJuly1st.csv) 
<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup>  from a company for sports statistics, containing data from all soccer players (N = 2053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3147) that these players played under in their professional career.

The two main goals of this homework is to:
* Train a random forest classifier that given a soccer player description outputs his skin color
* Use an unsupervised learning technique to cluster the soccer players in 2 disjoint clusters, and try to find if there are relation between the clusters and the players skin color.

<div class="footnotes">
<hr/>
<ol>
<li id="fn:1">
<p>Since we all live in developped countries, we will use the term "football" instead of "soccer" for all the homework*<a href="#fnref:1" rev="footnote">&#8617;</a></p></li>
</ol>
</div>


In [1]:
import pandas as pd                                     
import numpy as np                                      
import os 



import matplotlib.pyplot as plt

from datetime import datetime

%matplotlib inline
import seaborn as sns                                   # For pretty plots
import gensim


from os import path
from wordcloud import WordCloud
from wordcloud import STOPWORDS
from PIL import Image
from gensim import corpora, models, similarities



In [59]:
d = path.dirname("hillary-clinton-emails/")

# Read the whole text.
documents = open(path.join(d, 'Emails.csv')).read(5000)



[From Strings to vector](https://radimrehurek.com/gensim/tut1.html)

In [60]:
#stoplist = set('for a of the and to in - 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p q r s t u v w x y z é + " * ç % & / ()'.split())
#print(stoplist)
#documents = ["Human machine interface for lab abc computer applications"]
#type(documents.split())

documents = documents.split(',')
documents


['Id',
 'DocNumber',
 'MetadataSubject',
 'MetadataTo',
 'MetadataFrom',
 'SenderPersonId',
 'MetadataDateSent',
 'MetadataDateReleased',
 'MetadataPdfLink',
 'MetadataCaseNumber',
 'MetadataDocumentClass',
 'ExtractedSubject',
 'ExtractedTo',
 'ExtractedFrom',
 'ExtractedCc',
 'ExtractedDateSent',
 'ExtractedCaseNumber',
 'ExtractedDocNumber',
 'ExtractedDateReleased',
 'ExtractedReleaseInPartOrFull',
 'ExtractedBodyText',
 'RawText\n1',
 'C05739545',
 'WOW',
 'H',
 '"Sullivan',
 ' Jacob J"',
 '87',
 '2012-09-12T04:00:00+00:00',
 '2015-05-22T04:00:00+00:00',
 'DOCUMENTS/HRC_Email_1_296/HRCH2/DOC_0C05739545/C05739545.pdf',
 'F-2015-04841',
 'HRC_Email_296',
 'FW: Wow',
 '',
 '"Sullivan',
 ' Jacob J <Sullivan11@state.gov>"',
 '',
 '"Wednesday',
 ' September 12',
 ' 2012 10:16 AM"',
 'F-2015-04841',
 'C05739545',
 '05/13/2015',
 'RELEASE IN FULL',
 '',
 '"UNCLASSIFIED\nU.S. Department of State\nCase No. F-2015-04841\nDoc No. C05739545\nDate: 05/13/2015\nSTATE DEPT. - PRODUCED TO HOUSE SE

In [61]:
stoplist = set(STOPWORDS)

stoplist.add('1')
stoplist.add('2')
stoplist.add('3')
stoplist.add('4')
stoplist.add('5')
stoplist.add('6')
stoplist.add('7')
stoplist.add('8')
stoplist.add('9')
stoplist.add('10')

print(stoplist)


{'', 'her', "don't", 'an', 'its', 'they', 'as', 'this', 'were', 'am', 'into', 'after', 'could', 'you', 'those', 'to', '9', "here's", 'http', 'me', 'we', "i've", 'myself', "we've", 'yours', 'very', 'both', 'that', 'was', "mustn't", 'nor', 'r', '6', "she's", 'below', 'theirs', 'of', "i'd", 'further', 'because', 'some', "how's", '1', 'from', 'against', 'their', 'get', 'and', 'most', 'cannot', 'being', 'a', 'in', 'through', 'while', "you'd", 'same', 'ours', 'any', 'be', 'yourselves', 'herself', 'on', 'just', "wouldn't", 'before', 'there', "there's", 'then', 'more', "they'll", "hasn't", "they'd", 'down', "who's", 'i', "why's", '4', "when's", '3', "you've", 'above', 'under', "she'll", "i'll", 'like', 'does', 'would', "couldn't", 'for', "can't", 'been', '5', "it's", 'by', 'off', "weren't", 'your', "aren't", '2', "hadn't", 'too', 'which', 'out', 'over', "didn't", 'whom', "what's", 'between', 'few', 'why', 'ought', 'having', 'own', 'my', "you're", 'did', "i'm", "we'll", "he'll", 'doing', "let's

In [62]:
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

texts

[['id'],
 ['docnumber'],
 ['metadatasubject'],
 ['metadatato'],
 ['metadatafrom'],
 ['senderpersonid'],
 ['metadatadatesent'],
 ['metadatadatereleased'],
 ['metadatapdflink'],
 ['metadatacasenumber'],
 ['metadatadocumentclass'],
 ['extractedsubject'],
 ['extractedto'],
 ['extractedfrom'],
 ['extractedcc'],
 ['extracteddatesent'],
 ['extractedcasenumber'],
 ['extracteddocnumber'],
 ['extracteddatereleased'],
 ['extractedreleaseinpartorfull'],
 ['extractedbodytext'],
 ['rawtext'],
 ['c05739545'],
 ['wow'],
 ['h'],
 ['"sullivan'],
 ['jacob', 'j"'],
 ['87'],
 ['2012-09-12t04:00:00+00:00'],
 ['2015-05-22t04:00:00+00:00'],
 ['documents/hrc_email_1_296/hrch2/doc_0c05739545/c05739545.pdf'],
 ['f-2015-04841'],
 ['hrc_email_296'],
 ['fw:', 'wow'],
 [],
 ['"sullivan'],
 ['jacob', 'j', '<sullivan11@state.gov>"'],
 [],
 ['"wednesday'],
 ['september', '12'],
 ['2012', '10:16', 'am"'],
 ['f-2015-04841'],
 ['c05739545'],
 ['05/13/2015'],
 ['release', 'full'],
 [],
 ['"unclassified',
  'u.s.',
  'depar

In [63]:
# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
     for token in text:
            frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
          for text in texts]



For representing the questions only by ids (intergers) we need a dictonary to store the mapping.

In [64]:
dictionary = corpora.Dictionary(texts)
dictionary.save("part3/deerwester.dict")


In [65]:
print(dictionary)

Dictionary(91 unique tokens: ['house', 'u.s.', '10:16', 'date:', 'memo']...)


In [66]:
print(dictionary.token2id)

{'house': 17, 'u.s.': 18, '10:16': 12, 'date:': 19, 'memo': 61, '"unclassified': 34, 'loyal': 84, 'libyan': 76, 'thursday': 55, '9:45': 64, 'benghazi': 31, 'hrc_email_296': 7, 'leader': 78, 'no.': 32, 'army': 88, 'convinced': 83, 'forces.': 87, 'hrc': 60, 'agreement': 37, 'pm': 62, 'sullivan': 20, '&': 39, 'release': 15, 'case': 40, 'rebel': 81, 'report': 68, 'syria': 50, '2011': 58, 'civil': 73, '-': 33, 'sent:': 43, 'information': 26, 'for:': 66, 'doc': 28, 'foia': 29, 'former': 74, 'air': 70, 'more...': 48, 'waiver.': 30, 'h': 2, '12': 10, '030311.docx;': 57, 'redactions.': 38, 'added': 90, 'part': 54, 'training': 86, 'syrian': 89, 'produced': 41, '05/13/2015': 14, 'sid': 49, 'states': 79, 'qaddafi': 52, 'march': 56, 'fw:': 8, '"sullivan': 3, 'support': 71, 'troops': 82, 'subject:': 44, 'aiding': 53, 'libya': 59, '2012': 13, 'full': 16, 'comm.': 35, 'j': 9, 'war': 72, '2015-05-22t04:00:00+00:00': 5, 'sensitive': 22, 'jacob': 4, 'b6': 67, '030311.docx': 63, 'unclassified': 46, 'f-201

--------------------

In [67]:
d = path.dirname("hillary-clinton-emails/")

# Read the whole text.
text = open(path.join(d, 'Emails.csv')).read()


In [68]:
#id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt')
#id2word = gensim.corpora.Dictionary(text)



[Transformation interface](https://radimrehurek.com/gensim/tut2.html#available-transformations)