<H1><center>Project 4<BR><BR>
Recommender Systems</center></H1>

<H2>Task 1: Algorithmic Bias
</H2>

<P>Read <a href="https://cacm.acm.org/magazines/2016/10/207759-battling-algorithmic-bias/abstract">this article</a> (also available <a href="https://drive.google.com/file/d/1bRKH--BoCAQw7ZyK8xmBGGZR2lyPlRaN/view?usp=sharing">here</a>) about bias in algorithms.</P>

<P>Please address (minimum 200 words) the following questions in the space below. As the article describes, "common wisdom among programmers is to develop a pure algorithm that does not incorporate protected attributes into the model," where protected classes may include aspects such as race, gender, age, sexual orientation, and disability status. As a result, the algorithms can be bias. How might algorithms be evaluated for bias? Are there machine learning applications where you would sacrifice accuracy in order to reduce bias? Are there machine learning applications where you would allow bias in order to increase accuracy?</P>

To detect bias in an algorithm, you could look for patterns in the results and determine how those patterns were created. Since biases are inherently harmful and prejudiced, it is hard to imagine a case where we might allow bias in an algorithm. This would have to be a case where we want an algorithm to make an unfair judgment. I could see how this may be useful in some sort of academic study. But only if that study needs these biases to be present in order to draw conclusions. To prevent bias in algorithms, it may be good to try and eliminate bias at the source. So if the algorithm is using data from a sample, we should ensure that that data was collected properly and fairly. We can also evaluate the results of the algorithm to ensure that if the features that could lead to bias (race, gender, sexual-orientation, etc.) were changed, the algorithm’s output would remain the same. 



<H2>Task 2: Pipeline for Classification
</H2>

<P>As broader context for this project, little or no starter code is provided for the tasks in the project. One of the goals of this lack of starter code is to give you an opportunity to practice developing meaningful computational artifacts mostly or entirely from scratch.</P>

<P>To start, let's make a simple pipeline that integrates in one function some of the aspects of supervised classification.</P>

<P>Create a function named <code>pipeline</code> that takes three arguments:</P>
<ol>
    <li>The name of a comma separated values (CSV) file. Assume the file has a header row and all subsequent rows contain numerical data. The last column in the file corresponds to classification labels.</li>
    <li>A Boolean indicating whether feature scaling should be performed on the data in the file.</li>
    <li>A supervised classification model, such as a random forest classifier, a <em>k</em> nearest neighbors classifier, a logistic regression classifier, or a support vector machine.</li>
</ol>

<P>Your <code>pipeline</code> function should:</P>
<ol>
    <li>read in the specified file</li>
    <li>extract the feature vectors (e.g., <em>X</em>) and labels (e.g., <em>y</em>)</li>
    <li>split the data into training (80%) and testing (20%) (with <code>random_state=0</code>)</li>
    <li>perform feature scaling <em>if</em> the specified Boolean argument is <code>True</code></li>
    <li>train the classification model on the training data</li>
    <li>print out the accuracy of the classification model on the testing data</li>
</ol>


In [74]:
# The *pipeline* function executes some of the aspects of supervised
# classification on a set of data.
# The function takes three arguments: the name of a CSV file with a header line and
# whose last column contains labels, a Boolean indicating if feature scaling
# should be performed on the data in the file, and a supervised classification model.
# The function reads in data from the file, splits the data into training (80%)
# and testing (20%), performs feature scaling if specified by the Boolean argument,
# trains the model, and prints out the model's accuracy on the testing data.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn import neighbors #knn
from sklearn import linear_model #perceptron and logistic regression
from sklearn import ensemble #random forest
from sklearn import svm

import numpy as np

learners = {'Perceptron': linear_model.Perceptron(max_iter=10),
            'RandomForest': ensemble.RandomForestClassifier(),
            'kNN': neighbors.KNeighborsClassifier(), 
            'logistic': linear_model.LogisticRegression(random_state=0),
            'SVM':svm.SVC()
           }

#for classifierName in learners:
    #learners[classifierName].fit(X_train, y_train)
    #print('Accuracy of ' + classifierName + ':\t' + str(learners[classifierName].score(X_test, y_test)))

#decision tree
#DecisionTreeClassifier = tree.DecisionTreeClassifier() 
#RandomForestClassifier = ensemble.RandomForestClassifier()

#knn
#KNeighborsClassifier = neighbors.KNeighborsClassifier()

#perceptron
#Perceptron = linear_model.Perceptron(max_iter=10, random_state=0)

#logistic
#LogisticRegression = linear_model.LogisticRegression(random_state=0)

#svm 
#SupportVectorMachine = svm.SVC()

def pipeline(csv_file, shouldScale, model):
        
    #model = KNeighborsClassifier()
    
    DATA = np.loadtxt(csv_file, delimiter=',', skiprows=1)
    X = DATA[:, :-1]
    y = DATA[:, -1] 
      
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
    
    if(shouldScale):
        scaler = StandardScaler()
        X_trainScaled = scaler.fit_transform(X_train)
        learners[model].fit(X_trainScaled, y_train)
        result = learners[model].score(X_test, y_test)
        #result2 = model.score(X_train, y_train)
    else:
        learners[model].fit(X_train, y_train)
        result = learners[model].score(X_test, y_test)

    print('score on ' + str(csv_file) + ' data using ' + str(model) + ':\t' + str(result))
    #print('score on train data using ' + str(model) + ':\t' + str(result2))
    
        

<P>With the <code>pipeline</code> function above, you have a mechanism for quickly invoking several aspects of supervised classification. Below, you should experiment with executing your <code>pipeline</code> function on the following four datasets:</P>
<ol>
    <li>The file <code>admission.csv</code> contains GPA and GMAT score data on people who have/haven't been <a href="https://www.kaggle.com/willianleite/mba-admission">admitted to an MBA program</a> (graduate school in business). How well can you predict who will be admitted based on their GPA and their GMAT standardized test score?</li>
    <li>The file <code>dementia.csv</code> contains MRI scans and other health data on people who do/don't have <a href="https://www.kaggle.com/shashwatwork/dementia-prediction-dataset">dementia</a>. How well can you predict who has dementia based on this health data?</li>
    <li>The file <code>star.csv</code> contains chromaticity and spectral data for different stars in the galaxy. In this <a href="https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs">stellar classification problem</a>, how well can you determine which stars are "giants" and which are "dwarfs".</li>
    <li>The file <code>airlineDelays.csv</code> contains data on airplane flights and whether or not the flight was delayed. Data include information on the airline, the departing airport of the flight, the weather, the number of seats on the flight, the number of flight attendants, the age of the plane, etc. How well can you predict <a href="https://www.kaggle.com/threnjen/2019-airline-delays-and-cancellations">flight delays</a>?</li>
</ol>

In [77]:
# Experiments executing *pipeline* on four different data sets

pipeline('admission.csv',False, 'RandomForest')
#pipeline('admission.csv',False, RandomForestClassifier)
pipeline('dementia.csv',True, 'SVM')
pipeline('star.csv',True, 'kNN')
pipeline('airlineDelays.csv',True, 'logistic')

#bad results? Perceptron and Logistic have exact same score??? weird




score on admission.csv data using RandomForest:	1.0
score on dementia.csv data using SVM:	0.49333333333333335
score on star.csv data using kNN:	0.5015
score on airlineDelays.csv data using logistic:	0.494


<P><font color="maroon"><u>For the <code>admission</code> data, what is the testing accuracy <code>without</code> feature scaling and using a <code>Random Forest</code> classifier (with <code>random_state=0</code>)?<u></font></P>

100%

<P><font color="maroon"><u>For the <code>dementia</code> data, what is the testing accuracy <code>with</code> feature scaling and using a <code>Support Vector Machine</code> (with <code>random_state=0</code>)?<u></font></P>

Your answer here.

<P><font color="maroon"><u>For the <code>star</code> data, what is the testing accuracy <code>with</code> feature scaling and using a <code><em>k</em> Nearest Neighbors</code> classifier?<u></font></P>

Your answer here.

<P><font color="maroon"><u>For the <code>airline delays</code> data, what is the testing accuracy <code>with</code> feature scaling and using a <code>Logistic Regression</code> classifier (with <code>random_state=0</code>)?<u></font></P>

Your answer here.

<H2>Task 3: Sentiment Analysis of Movie Reviews
</H2>

<P>Remember in Exercise 3 where you performed sentiment analysis of Twitter data? If not, reviewing it will be very helpful here :). Here you will perform sentiment analysis of <a href="https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews">movie reviews</a>, i.e., you will determine whether a movie review is negative or positive based on the text of the review.</P>
<P>The file <code>movie_reviews.txt</code> contains 50,000 lines. Each line contains a review of a movie followed by a label, either "@@@negative" or "@@@positive", indicating if the review is negative or positive. Your task is to:</P>
<ol>
    <li>read in the data from the file</li>
    <li>randomly shuffle the data, while making sure each label stays with its corresponding review</li>
    <li>store the data as a corpus and as labels</li>
    <li>split the data into training (80%) and testing (20%)</li>
    <li>use the Bag of Words approach (tf-idf) to vectorize the data (you should use unigrams rather than bigrams here)</li>
    <li>use a logistic regression classifier to predict the sentiment of reviews</li>
    <li>based on the logistic regression model weights, determine the 10 lowest weighted words (indicative of negative sentiment reviews) and the 10 highest weighted words (indicative of positive sentiment reviews)</li>
</ol>

In [2]:
# Read in movie review data from file.
# Randomly shuffle data, making sure each label stays with its corresponding review.
# Store data as corpus and labels.
import random
random.seed(42)
import numpy as np

def readMovieFile(txt_fileName):
    DATA = []
    txt_file = open(txt_fileName, encoding="utf-8")
    reviews = txt_file.read().split('\n')

    for review in reviews: 
        reviewSplit = review.split('@@@') #review is an array with two strings: one with the review message and the other with the word positive/negative
        justReview = reviewSplit[0]
        justSentiment = reviewSplit[1]
        DATA.append([justReview, justSentiment])
        
    txt_file.close()

    random.shuffle(DATA)
    return [row[0] for row in DATA], [row[1] for row in DATA]

corpus, labels = readMovieFile('movie_reviews.txt')

In [3]:

TEST_SIZE = 0.2  # 20% testing data and 80% training data

separator = int((1.0 - TEST_SIZE)*len(corpus))
corpus_train = corpus[:separator]
labels_train = labels[:separator]
corpus_test = corpus[separator:]
labels_test = labels[separator:]
print(len(corpus_train))
print(len(labels_train))
print(len(corpus_test))
print(len(labels_test))

40000
40000
10000
10000


In [5]:
# Text feature extraction (using Bag of Words approach with unigrams) for training data

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,1))
X_train = vectorizer.fit_transform(corpus_train)
y_train = np.array(labels_train)
print(X_train.shape)
print(y_train.shape)

(40000, 93110)
(40000,)


In [6]:
# Text feature extraction (using Bag of Words approach with unigrams) for testing data

X_test = vectorizer.transform(corpus_test)
y_test = np.array(labels_test)
print(X_test.shape)
print(y_test.shape)

(10000, 93110)
(10000,)


In [7]:
# Use logistic regression to predict sentiment of reviews (with random_state=0)
from sklearn import linear_model

logistic = linear_model.LogisticRegression(random_state=0)
logistic.fit(X_train, y_train)
score = logistic.score(X_test, y_test)

print("Score of Logistic Regression on review sentiment: " + str(score))

Score of Logistic Regression on review sentiment: 0.8988


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [28]:
# Get Logistic Regression weights

learner = linear_model.LogisticRegression(random_state=0)
learner.fit(X_train, y_train)
weights = learner.coef_ 
sorted_weights = np.argsort(weights)  # Sort
features = vectorizer.get_feature_names()  # Get the features
print('\nLowest weighted words (indicative of negative sentiment reviews)')
for i in range(10): print(features[sorted_weights[0,i]])
print('\nHighest weighted words (indicative of positive sentiment reviews)')
for i in range(10): print(features[sorted_weights[0,len(sorted_weights[0])-i-1]])


Lowest weighted words (indicative of negative sentiment reviews)
worst
bad
waste
awful
boring
terrible
poor
nothing
horrible
worse

Highest weighted words (indicative of positive sentiment reviews)
great
excellent
best
perfect
wonderful
today
loved
amazing
hilarious
fun




<P><font color="maroon"><u>What is the testing accuracy of your logistic regression model at determining the sentiment of movie reviews?<u></font></P>

89.88%

<P><font color="maroon"><u>For your logistic regression classifier, what are the 10 lowest weighted words, indicative of negative reviews?<u></font></P>

worst
bad
waste
awful
boring
terrible
poor
nothing
horrible
worse

<P><font color="maroon"><u>For your logistic regression classifier, what are the 10 highest weighted words, indicative of positive reviews?<u></font></P>

great
excellent
best
perfect
wonderful
today
loved
amazing
hilarious
fun

<H2>Task 4: Detecting Fake News
</H2>

<P>Can you use machine learning to distinguish <a href="https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset">fake news stories from true news stories</a>? You are provided two files, each containing over 20,000 lines, where each line corresponds to a news story. The file <code>Fake_News.txt</code> contains fake news stories. The file <code>True_News.txt</code> contains true news stories. Using the Bag of Words (tf-idf) approach, your task is to distinguish fake news stories from true news stories. We are not providing much guidance here (and no starter code) so that you can think about how best to tackle this problem largely from scratch.</P>

In [8]:
#tokenize  returns a list of words(tokens) which was created from the input
def tokenize(text):
    return text.replace('!', ' ').replace('.', ' ').replace('?', ' ').replace(',', ' ').lower().split()

# Test the tokenize function
#tokenize('What is zero divided by zero?')

In [10]:
# Your code here. 
# Rather than put all your code in this one cell, create multiple new cells as needed.

#read in file
import random
random.seed(42)
def readNewsData(fake_filename, true_filename):
    DATA = []

    # Read in fake messages from file
    fake_file = open(fake_filename, encoding="utf-8") #why do we need this encoding line???
    fakeNewsStories = fake_file.read().split('\n')
    for fakeStory in fakeNewsStories: 
        DATA.append([fakeStory, 0])
    fake_file.close()
    
    # Read in true messages from file
    true_file = open(true_filename,  encoding="utf-8")
    trueNewsStories= true_file.read().split('\n')
    for trueStory in trueNewsStories: 
        DATA.append([trueStory, 1])
    true_file.close()

    random.shuffle(DATA)  # Shuffle
    return [row[0] for row in DATA], [row[1] for row in DATA]

corpus, labels = readNewsData('Fake_News.txt', 'True_News.txt')



In [11]:
# Separate into training and testing data
TEST_SIZE = 0.2  # 20% testing data and 80% training data

separator = int((1.0 - TEST_SIZE)*len(corpus))
corpus_train = corpus[:separator]
labels_train = labels[:separator]

corpus_test = corpus[separator:]
labels_test = labels[separator:]

In [12]:
#vectorizing the training data, i.e., use the Bag of Words approach to count how many times each word appears in each news story
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2))  # Use individual words as tokens
X_train = vectorizer.fit_transform(corpus_train)
y_train = np.array(labels_train)
print(X_train.shape)
print(y_train.shape)

(35407, 2541612)
(35407,)


In [13]:
#now the testing data
X_test = vectorizer.transform(corpus_test)
y_test = np.array(labels_test)

print(X_test.shape)
print(y_test.shape)

(8852, 2541612)
(8852,)


In [14]:
# Use different classifiers to predict if a news story is real or fake
from sklearn import linear_model
from sklearn import neighbors
from sklearn import ensemble
from sklearn import svm
"""
c = [0.1, 1.0, 10.0, 100.0, 1000.0]
gamma = [0.1,1.0, 10.0, 100.0, 1000.0]

for val in c:
    for value in gamma:
        vector = svm.SVC(C=val, gamma = value, max_iter = 10)
        vector.fit(X_train, y_train)
        score = vector.score(X_test, y_test)
        print("accuracy with c:" + str(val) + "and gamma:" + str(value) + "is:" + str(score))
"""
learners = {'Perceptron': linear_model.Perceptron(max_iter=10),
            #'RandomForest': ensemble.RandomForestClassifier(),
            'kNN': neighbors.KNeighborsClassifier(), 
            'logistic': linear_model.LogisticRegression(random_state=0),
            'SVM':svm.SVC(C=10.0, gamma=1.0, max_iter = 20)
           }
for classifierName in learners:
    learners[classifierName].fit(X_train, y_train)
    print('Accuracy of ' + classifierName + ':\t' + str(learners[classifierName].score(X_test, y_test)))

Accuracy of Perceptron:	0.9915273384545865
Accuracy of kNN:	0.8846588341617714
Accuracy of logistic:	0.9826028016267511




Accuracy of SVM:	0.7007455942159964


In [15]:
# Get Perceptron weights to figure out what features were weighted the highest and lowest for classification. 
learner = linear_model.Perceptron(max_iter=10)
learner.fit(X_train, y_train)

weights = learner.coef_  # Get the learned Perceptron weights
sorted_weights = np.argsort(weights)  # Sort
features = vectorizer.get_feature_names()  # Get the features

print('\nLowest weighted words (indicative of fake news stories)')
for i in range(10): print(features[sorted_weights[0,i]])
print('\nHighest weighted words (indicative of true news stories)')
for i in range(10): print(features[sorted_weights[0,len(sorted_weights[0])-i-1]])




Lowest weighted words (indicative of fake news stories)
via
read more
president trump
sen
read
this
mr
featured image
com
featured

Highest weighted words (indicative of true news stories)
said
on
said on
president donald
said in
on monday
president barack
republican
showed
reported on


<P><font color="maroon"><u>What is the testing accuracy of your machine learning method in determining whether a news story is fake?<u></font></P>

99.15%

<P><font color="maroon"><u>What are 10 words highly associated with true news stories?<u></font></P>

said
on
said on
president donald
said in
on monday
president barack
republican
showed
reported on

<P><font color="maroon"><u>What are 10 words highly associated with fake news stories?<u></font></P>

via
read more
president trump
sen
read
this
mr
featured image
com
featured

<H2>Submitting your work
</H2>

<P><font color="maroon"><u>Please indicate your name and the names of any partner that worked with you on this project:</u></font></P>

Name(s): Kate MacVicar and Hope Zhu

<P><font color="maroon"><u>Please indicate anyone else that you collaborated with in the process of doing the project:</u></font></P>

Collaborators: 

<P><font color="maroon"><u>When working on this project, approximately how many hours did you spend on each of (1) Task 1, (2) Task 2, (3) Task 3, (4) Task 4, (5) Task 5, and (6) Total?</u></font></P>

Hours on Task 1: 
Hours on Task 2: 
Hours on Task 3: 
Hours on Task 4: 
Hours on Task 5: 
Total hours: 

<P><font color="maroon"><u>When working on this project, did you abide by the <a href="https://www.wellesley.edu/studentlife/aboutus/honor">Honor Code</a> and is all of the work that you are submitting your own and/or your partner's?</u></font></P>

Abide by Honor Code: yes

<P><font color="maroon"><u>To submit this project, please upload your <code>Project4.ipynb</code> file to the <code>Project4</code> folder that the instructor created and shared with you in your Google drive.</u></font></P>