### Student Information
Name: Orison

Student ID: 106065425

Dataset Used: Sentiment Labelled Sentesces Data Set (URL:https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)

### Instructions

- Download the dataset provided in this [link](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#). The sentiment dataset contains a `sentence` and `score` label. Read what the dataset is about on the link provided before you start exploring it. 


- Then, you are asked to apply each of the data exploration and data operation techniques learned in the [first lab session](https://goo.gl/Sg4FS1) on the new dataset. You don't need to explain all the procedures as we did in the notebook, but you are expected to provide some **minimal comments** explaining your code. You are also expected to use the same libraries used in the first lab session. You are allowed to use and modify the `helper` functions we provided in the first lab session or create your own. Also, be aware that the helper functions may need modification as you are dealing with a completely different dataset. This part is worth 80% of your grade!


- After you have completed the operations, you should attempt the **bonus exercises** provided in the [notebook](https://goo.gl/Sg4FS1) we used for the first lab session. There are six (6) additional exercises; attempt them all, as it is part of your grade (10%). 


- You are also expected to tidy up your notebook and attempt new data operations that you have learned so far in the Data Mining course. Surprise us! This segment is worth 10% of your grade.


- After completing all the above tasks, you are free to remove this header block and submit your assignment following the guide provided in the `README.md` file of the assignment's [repository](https://github.com/omarsar/data_mining_hw_1). 

# Dataset Information

Sentiments Labelled

This dataset was created for the paper 'From Group to Individual Labels using Deep Features'

It uses a sentence score, either 1 or 0 to determine if the sentence is positive or negative, respectively. They were selected from three different websites. No neutral sentences were intended for selection.



# 0. Importing Libraries

In [1]:
# necessary for when working with external scripts
%load_ext autoreload
%autoreload 2

In [18]:
import pandas as pd
import numpy as np
import nltk


from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
import plotly.plotly as py
import plotly.graph_objs as go
import math
%matplotlib inline

# my functions
import helpers.data_mining_helpers as dmh
import helpers.text_analysis as ta

# 1. Data Source

# 2. Preparing the Data

First let's load some data and see if it works well.

In [127]:
imdb = pd.read_csv("data/imdb_labelled.txt", header=0, sep='\t',quoting=3, names = ["Sentence","Score"])


In [128]:
print (imdb)

                                              Sentence  Score
0    Not sure who was more lost - the flat characte...      0
1    Attempting artiness with black & white and cle...      0
2         Very little music or anything to speak of.        0
3    The best scene in the movie was when Gerardo i...      1
4    The rest of the movie lacks art, charm, meanin...      0
5                                  Wasted two hours.        0
6    Saw the movie today and thought it was a good ...      1
7                                 A bit predictable.        0
8    Loved the casting of Jimmy Buffet as the scien...      1
9                 And those baby owls were adorable.        1
10   The movie showed a lot of Florida at it's best...      1
11   The Songs Were The Best And The Muppets Were S...      1
12                                   It Was So Cool.        1
13   This is a very "right on case" movie that deli...      1
14   It had some average acting from the main perso...      0
15   Thi

In [106]:
len(imdb)

999

In [107]:
print(imdb.shape)

(999, 2)


In [108]:
imdb[0:10]

Unnamed: 0,Sentence,Score
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0
5,Wasted two hours.,0
6,Saw the movie today and thought it was a good ...,1
7,A bit predictable.,0
8,Loved the casting of Jimmy Buffet as the scien...,1
9,And those baby owls were adorable.,1


In [111]:
imdb.sample(n=25)

Unnamed: 0,Sentence,Score
41,Not only did it only confirm that the film wou...,0
767,PS the only scene in the movie that was cool i...,1
914,I didn't realize how wonderful the short reall...,1
10,The movie showed a lot of Florida at it's best...,1
890,Now this is a movie I really dislike.,0
330,"For those that haven't seen it, don't waste yo...",0
329,"The hockey scenes are terrible, defensemen pla...",0
986,This movie is well-balanced with comedy and dr...,1
467,I knew when I saw the film that more great thi...,1
805,The worst one of the series.,0


In [114]:
print(imdb.Sentence[0])

Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  


In [113]:
print(imdb.Score[0])

0


In [126]:
imdb.Score[:10]

0    0
1    0
2    0
3    1
4    0
5    0
6    1
7    0
8    1
9    1
Name: Score, dtype: int64

# 3. Checking the Data with Pandas

In [187]:
imdb['Score']

0      0
1      0
2      0
3      1
4      0
5      0
6      1
7      0
8      1
9      1
10     1
11     1
12     1
13     1
14     0
15     1
16     1
17     1
18     1
19     1
20     1
21     1
22     1
23     1
24     0
25     0
26     1
27     1
28     1
29     1
      ..
969    1
970    1
971    0
972    0
973    0
974    1
975    1
976    0
977    1
978    1
979    1
980    1
981    1
982    1
983    1
984    1
985    1
986    1
987    1
988    1
989    1
990    1
991    1
992    1
993    0
994    0
995    0
996    0
997    0
998    0
Name: Score, Length: 999, dtype: int64

In [168]:
X = pd.DataFrame.from_records(dmh.format_rows(imdb), columns= ['Scores'])

In [167]:
len(X)

2

In [172]:
imdb[0:10][["Sentence","Score"]]

Unnamed: 0,Sentence,Score
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0
5,Wasted two hours.,0
6,Saw the movie today and thought it was a good ...,1
7,A bit predictable.,0
8,Loved the casting of Jimmy Buffet as the scien...,1
9,And those baby owls were adorable.,1


In [173]:
imdb[-11:-1]

Unnamed: 0,Sentence,Score
988,":) Anyway, the plot flowed smoothly and the ma...",1
989,"The opening sequence of this gem is a classic,...",1
990,Fans of the genre will be in heaven.,1
991,Lange had become a great actress.,1
992,It looked like a wonderful story.,1
993,I never walked out of a movie faster.,0
994,I just got bored watching Jessice Lange take h...,0
995,"Unfortunately, any virtue in this film's produ...",0
996,"In a word, it is embarrassing.",0
997,Exceptionally bad!,0


In [174]:
imdb.iloc[::10, :][0:10]

Unnamed: 0,Sentence,Score
0,Not sure who was more lost - the flat characte...,0
10,The movie showed a lot of Florida at it's best...,1
20,"In other words, the content level of this film...",1
30,Waste your money on this game.,1
40,I wasn't the least bit interested.,0
50,In addition to having one of the most lovely s...,1
60,All in all I give this one a resounding 9 out ...,1
70,Often the dialogue doesn't really follow from ...,0
80,This if the first movie I've given a 10 to in ...,1
90,The problem was the script.,0


# 4. Data Mining

In [135]:
imdb.isnull()

Unnamed: 0,Sentence,Score
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
5,False,False
6,False,False
7,False,False
8,False,False
9,False,False


In [136]:
imdb.isnull().apply(lambda x: dmh.check_missing_values(x))

Sentence    (The amoung of missing records is: , 0)
Score       (The amoung of missing records is: , 0)
dtype: object

In [161]:
dummy = pd.Series(["dummy_record", ""], index=["Sentence", "Score"])

In [162]:
dummy 

Sentence    dummy_record
Score                   
dtype: object

In [163]:
result_with_series = imdb.append(dummy, ignore_index=True)

In [164]:
len(result_with_series)

1000

In [165]:
result_with_series.isnull().apply(lambda x: dmh.check_missing_values(x))

Sentence    (The amoung of missing records is: , 0)
Score       (The amoung of missing records is: , 0)
dtype: object