# Frame the Problem and Look at the Big Picture 

1. Define the objective in business terms. 

    The business objective is to create a machine learning model that takes a movie goer's review and determines if it is positive or negative. The client can use this to quickly determine the overall reaction to a movie showing. 

2. How will your solution be used? 

    The solution will be used by the client to quickly determine reactions to movie showings. 

3. What are the current solutions/workarounds (if any)? 

    The current solution is to manually poll viewers to determine how they felt about the movie. This is a slow process and the review can be very mixed. 

4. How should you frame this problem (supervised/unsupervised, online/offline, ...)?

    This problem is a supervised offline sentiment analysis problem. 

5. How should performance be measured? Is the performance measure aligned with the business objective? 

    Our metric for this problem will be accuracy. The most important function of this model is to accurately measure the sentiment of the review, so it is important that our model has high accuracy.

6. What would be the minimum performance needed to reach the business objective? 

    The minimum performance required is 80% accuracy. 

7. What are comparable problems? Can you reuse experience or tools? 

    A comparable problem was using by a youtuber named Micheal Reeves. He used sentiment analysis to determine if reddit posts on r/wallstreetbets were positive or negative. We can reuse the experience but not the tools he used. There are resuable tools in the textbook for this course that we will be using. 

8. Is human expertise available? 

    No human expertise is availible at this time. 

9. How would you solve the problem manually? 

    To solve this problem manually, we would interview the movie goers individually to determine their feelings about the movie. 

10.  List the assumptions you (or others) have made so far. Verify assumptions if possible. 

    - A review can either be positive or negative. There are no "meh" reviews. 

# Get the Data 
1. List the data you need and how much you need 
2. Find and document where you can get that data 
3. Get access authorizations 
4. Create a workspace (with enough storage space) 
5. Get the data 
6. Convert the data to a format you can easily manipulate (without changing the data itself) 
7. Ensure sensitive information is deleted or protected (e.g. anonymized) 
8. Check the size and type of data (time series, geographical, ...) 
9. Sample a test set, put it aside, and never look at it (no data snooping!) 

In [1]:
import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer

In [2]:
file = pd.read_csv("movie.csv")

In [3]:
file

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1
...,...,...
39995,"""Western Union"" is something of a forgotten cl...",1
39996,This movie is an incredible piece of work. It ...,1
39997,My wife and I watched this movie because we pl...,0
39998,"When I first watched Flatliners, I was amazed....",1


In [4]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40000 entries, 0 to 39999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    40000 non-null  object
 1   label   40000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 625.1+ KB


In [17]:
CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english')


CountVec.fit(list(file.text))

CountVectorizer(stop_words='english')

In [18]:
vector = CountVec.transform(list(file.text))

In [20]:
print(vector.shape)
print(type(vector))
print(vector.toarray())

(40000, 92598)
<class 'scipy.sparse.csr.csr_matrix'>
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 2 0 ... 0 0 0]]


In [23]:
# Checking the first sentence and seeing its sum. 
vector.toarray()[0]
sum = 0 
for i in vector.toarray()[0]:
    sum += i
sum

76

In [17]:
tokenizer = RegexpTokenizer("[\w']+")
list_of_reviews = []
for text in file.text:
    list_of_reviews.append(tokenizer.tokenize(text))
list_of_reviews

[['I',
  'grew',
  'up',
  'b',
  '1965',
  'watching',
  'and',
  'loving',
  'the',
  'Thunderbirds',
  'All',
  'my',
  'mates',
  'at',
  'school',
  'watched',
  'We',
  'played',
  'Thunderbirds',
  'before',
  'school',
  'during',
  'lunch',
  'and',
  'after',
  'school',
  'We',
  'all',
  'wanted',
  'to',
  'be',
  'Virgil',
  'or',
  'Scott',
  'No',
  'one',
  'wanted',
  'to',
  'be',
  'Alan',
  'Counting',
  'down',
  'from',
  '5',
  'became',
  'an',
  'art',
  'form',
  'I',
  'took',
  'my',
  'children',
  'to',
  'see',
  'the',
  'movie',
  'hoping',
  'they',
  'would',
  'get',
  'a',
  'glimpse',
  'of',
  'what',
  'I',
  'loved',
  'as',
  'a',
  'child',
  'How',
  'bitterly',
  'disappointing',
  'The',
  'only',
  'high',
  'point',
  'was',
  'the',
  'snappy',
  'theme',
  'tune',
  'Not',
  'that',
  'it',
  'could',
  'compare',
  'with',
  'the',
  'original',
  'score',
  'of',
  'the',
  'Thunderbirds',
  'Thankfully',
  'early',
  'Saturday',
  '

In [18]:
list_of_reviews[0]

['I',
 'grew',
 'up',
 'b',
 '1965',
 'watching',
 'and',
 'loving',
 'the',
 'Thunderbirds',
 'All',
 'my',
 'mates',
 'at',
 'school',
 'watched',
 'We',
 'played',
 'Thunderbirds',
 'before',
 'school',
 'during',
 'lunch',
 'and',
 'after',
 'school',
 'We',
 'all',
 'wanted',
 'to',
 'be',
 'Virgil',
 'or',
 'Scott',
 'No',
 'one',
 'wanted',
 'to',
 'be',
 'Alan',
 'Counting',
 'down',
 'from',
 '5',
 'became',
 'an',
 'art',
 'form',
 'I',
 'took',
 'my',
 'children',
 'to',
 'see',
 'the',
 'movie',
 'hoping',
 'they',
 'would',
 'get',
 'a',
 'glimpse',
 'of',
 'what',
 'I',
 'loved',
 'as',
 'a',
 'child',
 'How',
 'bitterly',
 'disappointing',
 'The',
 'only',
 'high',
 'point',
 'was',
 'the',
 'snappy',
 'theme',
 'tune',
 'Not',
 'that',
 'it',
 'could',
 'compare',
 'with',
 'the',
 'original',
 'score',
 'of',
 'the',
 'Thunderbirds',
 'Thankfully',
 'early',
 'Saturday',
 'mornings',
 'one',
 'television',
 'channel',
 'still',
 'plays',
 'reruns',
 'of',
 'the',
 'ser