This jupyter notebook contains the python code that we used to create a linear SVM using `sklearn`.
For this to run correctly, make sure that you have the 'cat/' folder with the corpus of training documents, and the 'tweets/' folder, with the tweets from the different cities in the same directory as this jupyter notebook.

In [3]:
import numpy as np  
import re  
import nltk  
from sklearn.datasets import load_files  
nltk.download('stopwords')  
import pickle  
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to /home/mk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [75]:
# Load the data from the category folder
film_data = load_files('./cat')
X, y = film_data.data, film_data.target  

In [76]:
# See the categories of some of the documents in the target set
for t in film_data.target[:10]:
    print(film_data.target_names[t])

music
music
sports
music
film
film
music
sports
sports
music


In [77]:
# pre-process text
stemmer = PorterStemmer()
documents = []

for sen in range(0, len(X)):  
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))

    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 

    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)

    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)

    # Converting to Lowercase
    document = document.lower()

    # Lemmatization
    document = document.split()

    # document = [stemmer.lemmatize(word) for word in document]
    document = [stemmer.stem(word) for word in document]
    document = ' '.join(document)

    documents.append(document)

In [78]:
# create vectorizer to make feature vectors
from sklearn.feature_extraction.text import CountVectorizer  
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
X = vectorizer.fit_transform(documents).toarray()

In [79]:
# tf-idf
from sklearn.feature_extraction.text import TfidfTransformer  
tfidfconverter = TfidfTransformer()  
X = tfidfconverter.fit_transform(X).toarray() 

In [80]:
# divide into testing and training sets
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) 

Run one of the two cells below to make a different classifier. and see the difference in accuracy between the naieve bayes model and the linear SVM

In [22]:
# make a classifier with Naive Bayes
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(X_train, y_train)

In [91]:
# classifier with linear SVM
from sklearn.linear_model import SGDClassifier
classifier = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None).fit(X_train, y_train)

Once you have made a classifier, test its accuracy on with the cell below. Try both naieve bayes and linear SVM and see the difference.

In [92]:
# test predictions
y_pred = classifier.predict(X_test)

# Evaluating the model
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))  
print(accuracy_score(y_test, y_pred))  

[[7 1 0]
 [0 2 0]
 [0 0 2]]
             precision    recall  f1-score   support

          0       1.00      0.88      0.93         8
          1       0.67      1.00      0.80         2
          2       1.00      1.00      1.00         2

avg / total       0.94      0.92      0.92        12

0.9166666666666666


Now to run predictions on our tweets.
Change the value of `city` in the cells below to either 'SanDiego', 'NewYork', or 'Denver' to run the classifier on a different city, or to see the results from a different city.

In [96]:
# predicting on tweets
import json
import os

city = 'NewYork'


for root, dirs, files, in os.walk('tweets/{}'.format(city)):
    for dirname in dirs:
        tweets = {}
        docs_new = []

        # classifying each tweet from user
        tweets_fname = 'tweets/{}/{}/tweets.json'.format(city, dirname)
        onedoc_fname = 'tweets/{}/{}/text.txt'.format(city, dirname)
        t_results_fname = 'tweets/{}/{}/t_results.json'.format(city, dirname)
        o_results_fname = 'tweets/{}/{}/o_results.json'.format(city, dirname)
        print(tweets_fname)
        with open(tweets_fname, 'r') as f:
            tweets = json.loads(f.read())
        for tweet in tweets:
            docs_new.append(tweet['text'])

        X_new_counts = vectorizer.transform(docs_new)
        X_new_tfidf = tfidfconverter.transform(X_new_counts)

        predicted = classifier.predict(X_new_tfidf)
        

        results = []
        for i in range(0, len(tweets)):
            results.append({'doc': tweets[i]['text'], 'id': tweets[i]['id'], 
                            'pred_num': int(predicted[i]), 'pred_cat': film_data.target_names[predicted[i]]})
            
        # Write results
        with open(t_results_fname, 'w') as f:
            f.write(json.dumps(results))

        # Classifying aggregate of user's tweets
        docs_new.clear()
        with open(onedoc_fname, 'r') as f:
            docs_new.append(f.read())

        X_new_counts = vectorizer.transform(docs_new)
        X_new_tfidf = tfidfconverter.transform(X_new_counts)

        predicted = classifier.predict(X_new_tfidf)

        # write results
        with open(o_results_fname, 'w') as f:
            result = {'pred_num': int(predicted[0]), 'pred_cat': film_data.target_names[predicted[0]]}
            f.write(json.dumps(result))


tweets/NewYork/21990748/tweets.json
tweets/NewYork/21734241/tweets.json
tweets/NewYork/218617107/tweets.json
tweets/NewYork/156720500/tweets.json
tweets/NewYork/64467795/tweets.json
tweets/NewYork/245272915/tweets.json
tweets/NewYork/556256860/tweets.json
tweets/NewYork/255812611/tweets.json
tweets/NewYork/43040265/tweets.json
tweets/NewYork/304627263/tweets.json
tweets/NewYork/20551303/tweets.json
tweets/NewYork/30155972/tweets.json
tweets/NewYork/284274787/tweets.json
tweets/NewYork/3014741/tweets.json
tweets/NewYork/21030857/tweets.json
tweets/NewYork/466744226/tweets.json
tweets/NewYork/315142011/tweets.json
tweets/NewYork/90926420/tweets.json
tweets/NewYork/426962930/tweets.json
tweets/NewYork/1977157224/tweets.json
tweets/NewYork/159762740/tweets.json
tweets/NewYork/335946517/tweets.json
tweets/NewYork/254108775/tweets.json
tweets/NewYork/65431946/tweets.json
tweets/NewYork/48849410/tweets.json
tweets/NewYork/22320759/tweets.json
tweets/NewYork/874650841/tweets.json
tweets/NewYor

In [103]:
# looking at users results
from collections import Counter

cat_counts = Counter()
city = 'NewYork'
results = []
total_users = 0
for root, dirs, files, in os.walk('tweets/{}'.format(city)):
    for dirname in dirs:
        total_users += 1
        o_results_fname = 'tweets/{}/{}/o_results.json'.format(city, dirname)
        with open(o_results_fname, 'r') as f:
            result = json.loads(f.read())
            results.append(result['pred_cat'])
            
cat_counts = Counter(results)
for k,val in cat_counts.items():
    print('{}:{} {}'.format(k, val, val/total_users))
    
        

sports:123 0.7365269461077845
film:44 0.2634730538922156


In [104]:
# looking at results of tweets
city = 'NewYork'
cat_counts = Counter()
tweets = 0
for root, dirs, files, in os.walk('tweets/{}'.format(city)):
    for dirname in dirs:
        results = {}
        cats = []
        t_results_fname = 'tweets/{}/{}/t_results.json'.format(city, dirname)
        with open(t_results_fname, 'r') as f:
            results = json.loads(f.read())
        tweets += len(results)
        for tweet in results:
            cats.append(tweet['pred_cat'])
        cat_counts.update(cats)
        
for k,val in cat_counts.items():
    print('{}:{} {}'.format(k, val, val/tweets))


film:9128 0.2795626473921166
music:4841 0.14826498422712933
sports:18682 0.572172368380754


Here are some outpus from each cell with the different cities using the linear SVM
# SanDiego 

## Users
     count %
film:43 0.172
sports:206 0.824
music:1 0.004

## Tweets
film:13123 0.26803513071895424
sports:28805 0.5883374183006536
music:7032 0.14362745098039215

# NewYork

## Users
sports:123 0.7365269461077845
film:44 0.2634730538922156

## Tweets
film:9128 0.2795626473921166
music:4841 0.14826498422712933
sports:18682 0.572172368380754

# Denver

## Users
sports:168 0.9032258064516129
film:18 0.0967741935483871

## Tweets
sports:22391 0.6072464947251376
film:9513 0.25799365389309253
music:4969 0.13475985138176985





In [90]:
# saving model
import pickle
with open('text_classifier', 'wb') as picklefile:  
    pickle.dump(classifier,picklefile)