# Lab 3
## More EDA: improving expertise in loading, cleaning, and analyzing data

The objective of Lab 3 is for you to become more proficient in obtaining and working with different types of data. A particular emphasis will be on dealing with text data.

This lab assignment will have 3 components. 

## Lab 3.A. Complete tutorials from Harvard's CS109 Lab 1

Go to https://github.com/cs109/2015lab1 and download the following files in your local Lab3 directory:
- https://github.com/cs109/2015lab1/blob/master/all.csv
- https://github.com/cs109/2015lab1/blob/master/hamlet.txt

We are going to go through the *Lab1-babypython.ipynb* and *Lab1-pythonpandas.ipynb*. The orginal Python notebooks were written in Python 2. We converted the notebooks into Python 3, which can be downloaded from here"

- https://github.com/CIS3715-temple-2019/CIS3715-temple-2019.github.io/blob/master/CIS3715-Lab3.A-babypython_py3.ipynb
- https://github.com/CIS3715-temple-2019/CIS3715-temple-2019.github.io/blob/master/CIS3715-Lab3.A-pythonpandas_py3.ipynb

Study all the code and run every block of code from the *babypython* tutorial. It covers many of the things you already learned in your Labs 1 and 2, so it is a good refresher. However, there are some new things. In particular, you will learn how to load a pure textual file and process it to find counts of all the unique words (also called the tokens) in the text.

Study all the code and run every block of code from the *pythonpandas* tutorial. Again, you will find there many things you already know. However, the novelty here is in processing and analysis of a slightly messy tabular data than was the case with the *Auto MPG data*.



**Deliverable**: submit the two .ipynb files after you have run all the lines of code. We will appreciate if we see that you put some extra effort, such as trying to modify existing code, enter new lines of code, or provide comments in the text. Make sure any modifications are easily visible by us for the grading purposes.

## Lab 3.B. Movie Lens Data

In this part of the lab, you will be working on an exercise that is a slightly modified and shortened version of https://github.com/cs109/2015/blob/master/Lectures/02-DataScrapingQuizzes.ipynb. In particular, you will learn how to load and analyze MoviLens data, which contains ratings of multiple movies by multiple users.

**The MovieLens data**

http://grouplens.org/datasets/movielens/

Take some time to learn about the data, because it will be helpful to do the assignment.


In [10]:
## all imports
from IPython.display import HTML
import numpy as np
import requests
import bs4 #this is beautiful soup
import time
import operator
import socket
import re # regular expressions

from pandas import Series
import pandas as pd
from pandas import DataFrame

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
# Read the user data:
#   pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']

users = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.user', 
    sep='|', names=u_cols, engine='python')

users.head(20)

Unnamed: 0,user_id,age,sex,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [7]:
# Read the ratings:
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.data', 
    sep='\t', names=r_cols, engine='python')

ratings.head(20) 

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


In [204]:
# Read the movies data
#  the movies file contains columns indicating the movie's genres
#  let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 
            'video_release_date', 'imdb_url']

movies = pd.read_csv(
    'http://files.grouplens.org/datasets/movielens/ml-100k/u.item', sep='|', encoding = "ISO-8859-1", names=m_cols, usecols=range(5), engine='python')

movies

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995)
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...
7,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995)
8,9,Dead Man Walking (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Dead%20Man%20...
9,10,Richard III (1995),22-Jan-1996,,http://us.imdb.com/M/title-exact?Richard%20III...


Get information about the data:

In [83]:
movies['release_date'] = movies.release_date.astype('datetime64') # converted the dtype of release_date from object -> datetime64
print(movies.dtypes)
print()
print(movies.describe())
# *** Why only those two columns? ***

movie_id                       int64
title                         object
release_date          datetime64[ns]
video_release_date           float64
imdb_url                      object
dtype: object

          movie_id  video_release_date
count  1682.000000                 0.0
mean    841.500000                 NaN
std     485.695893                 NaN
min       1.000000                 NaN
25%     421.250000                 NaN
50%     841.500000                 NaN
75%    1261.750000                 NaN
max    1682.000000                 NaN


Selecting data:

* DataFrame => group of Series with shared index
* single DataFrame column => Series

In [113]:
users.head()
print("hello")
users['occupation'].head()
## *** Where did the nice design go? ***
columns_you_want = ['occupation', 'sex'] 
users[columns_you_want].head()


hello


Unnamed: 0,occupation,sex
0,technician,M
1,other,F
2,writer,M
3,technician,M
4,other,F


Filtering data:

Select users older than 25

In [None]:
oldUsers = users[users.age > 25]
oldUsers.head()

**Question 1**: 
* show users aged 40 and male
* show the mean age of female programmers

In [107]:
# users aged 40 AND male
oldUsers = users[(users.age == 40) & (users.sex == 'M')] # users aged 40 and male
print(oldUsers)

## users who are female and programmers
female = users[(users.sex == "F") & (users.occupation == "programmer")] 
## show statistic summary or compute mean
print("mean age of female programmers is", np.mean(female.age))  # mean age of female programmers


     user_id  age sex  occupation zip_code
18        19   40   M   librarian    02138
82        83   40   M       other    44133
115      116   40   M  healthcare    97232
199      200   40   M  programmer    93402
283      284   40   M   executive    92629
289      290   40   M    engineer    93550
308      309   40   M   scientist    70802
357      358   40   M    educator    10022
397      398   40   M       other    60008
564      565   40   M     student    55422
646      647   40   M    educator    45810
791      792   40   M  programmer    12205
841      842   40   M      writer    93055
917      918   40   M   scientist    70116
mean age of female programmers is 32.166666666666664


Find Diligent Users

- split data per user ID
- count ratings
- combine result



In [None]:
print(ratings.head())
## split data
grouped_data = ratings.groupby('user_id')
#grouped_data = ratings['movie_id'].groupby(ratings['user_id'])

## count and combine
ratings_per_user = grouped_data.count()

ratings_per_user.head(5)

**Question 2**:
* get the average rating per movie
* advanced: get the movie titles with the highest average rating

In [260]:
## split data
movie_id_and_rating = ['movie_id', 'rating']
a = ratings[movie_id_and_rating]
sorted_a = a.sort_values(by=['movie_id'])
arr = np.array([])
num_of_ratings_per_movie = []

## average and combine
for i in range(1,1683):
    mean = np.mean(sorted_a[sorted_a.movie_id == i].rating)
    num_of_ratings_per_movie.append(np.sum(sorted_a[sorted_a.movie_id == i].rating))
    arr = np.append(arr, mean)

# displays the movie title and the average rating alongside of it
for i,v in enumerate(arr,1):
    name_of_movie = movies[movies.movie_id == i]
    print(movies[movies.movie_id == i].title, "-- AVERAGE RATING OF MOVIE = ", v)

# get the maximum rating
max_rate = np.amax(arr)
print("MAXIMUM RATING IS ", max_rate)
# your code here

# get movie ids with that rating (anything higher than 4.8)
high_rated = np.where(arr > 4.8)
high_rated = high_rated[0]
l = np.ndarray.tolist(high_rated)
print()
print()
for i in l:
    print(movies[movies.movie_id == i].title, "****** HAS A HIGH RATING *******")

print()
print()
print("Good movie ids:")
for i in l:
    print(i, "****** GOOD MOVIE IDS *******")

print()
print()

print("Best movie titles")
for i in l:
    print(movies[movies.movie_id == i].title, "****** BEST MOVIE TITLES *******")


# get number of ratings per movie
print("Number of ratings per movie")
num = enumerate(num_of_ratings_per_movie, 1)
for i,v in num:
    print("movie_id", i, "number of ratings", v)


0    Toy Story (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.8783185840707963
1    GoldenEye (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.2061068702290076
2    Four Rooms (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.033333333333333
3    Get Shorty (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.550239234449761
4    Copycat (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.302325581395349
5    Shanghai Triad (Yao a yao yao dao waipo qiao) ...
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.576923076923077
6    Twelve Monkeys (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.798469387755102
7    Babe (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.9954337899543377
8    Dead Man Walking (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.8963210702341136
9    Richard III (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.2150170648464163
118    Maya Lin: A Strong Clear Vision (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.5
119    Striptease (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.2388059701492535
120    Independence Day (ID4) (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.438228438228438
121    Cable Guy, The (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.339622641509434
122    Frighteners, The (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.234782608695652
123    Lone Star (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.053475935828877
124    Phenomenon (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.557377049180328
125    Spitfire Grill, The (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.804123711340206
126    Godfather, The (1972)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.775
234    Mars Attacks! (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.847926267281106
235    Citizen Ruth (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6222222222222222
236    Jerry Maguire (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.7109375
237    Raising Arizona (1987)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.875
238    Sneakers (1992)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.513333333333333
239    Beavis and Butt-head Do America (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.7884615384615383
240    Last of the Mohicans, The (1992)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.546875
241    Kolya (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.9914529914529915
242    Jungle2Jungle (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.4393939393939394
243    Smilla's S

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6545454545454548
344    Deconstructing Harry (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.4
345    Jackie Brown (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.642857142857143
346    Wag the Dog (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.510948905109489
347    Desperate Measures (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.2962962962962963
348    Hard Rain (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.903225806451613
349    Fallen (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.1463414634146343
350    Prophecy II, The (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.7
351    Spice World (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.1153846153846154
352    Deep Rising (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.4285714285714284
353    Wedding Singer, T

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.75
459    Crossing Guard, The (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
460    Smoke (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6216216216216215
461    Like Water For Chocolate (Como agua para choco...
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.804054054054054
462    Secret of Roan Inish, The (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.859154929577465
463    Vanya on 42nd Street (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.5925925925925926
464    Jungle Book, The (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.5647058823529414
465    Red Rock West (1992)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.5961538461538463
466    Bronx Tale, A (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.7916666666666665
467    Rudy (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =

561    Quick and the Dead, The (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.2083333333333335
562    Stephen King's The Langoliers (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.413793103448276
563    Tales from the Hood (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.037037037037037
564    Village of the Damned (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.6818181818181817
565    Clear and Present Danger (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.569832402234637
566    Wes Craven's New Nightmare (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.8
567    Speed (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6478260869565218
568    Wolf (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.701492537313433
569    Wyatt Earp (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.1
570    Another Stakeout (1993)
Name: title, dtype: obj

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
677    Volcano (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.808219178082192
678    Conan the Barbarian (1981)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.046728971962617
679    Kull the Conqueror (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.5588235294117645
680    Wishmaster (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.4444444444444446
681    I Know What You Did Last Summer (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.06
682    Rocket Man (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.4081632653061225
683    In the Line of Fire (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.7928994082840237
684    Executive Decision (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.356687898089172
685    Perfect World, A (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.68
686

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0833333333333335
796    Timecop (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.096774193548387
797    Bad Company (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.2222222222222223
798    Boys Life (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.2
799    In the Mouth of Madness (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.8846153846153846
800    Air Up There, The (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.8125
801    Hard Target (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.7
802    Heaven & Earth (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.5555555555555554
803    Jimmy Hollywood (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.875
804    Manhattan Murder Mystery (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6296296296296298
805    Menace II Society (1993

909    Nil By Mouth (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
910    Twilight (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.5
911    U.S. Marshalls (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
912    Love and Death on Long Island (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.5
913    Wild Things (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.727272727272727
914    Primary Colors (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.923076923076923
915    Lost in Space (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.1666666666666665
916    Mercury Rising (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.4285714285714284
917    City of Angels (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
918    City of Lost Children, The (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.7916666666666665
919    Two B

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.8857142857142857
1020    8 1/2 (1963)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.8157894736842106
1021    Fast, Cheap & Out of Control (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.4375
1022    Fathers' Day (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.5806451612903225
1023    Mrs. Dalloway (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6
1024    Fire Down Below (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.9318181818181817
1025    Lay of the Land, The (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.5
1026    Shooter, The (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.6666666666666665
1027    Grumpier Old Men (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0405405405405403
1028    Jury Duty (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
1029    Beverly Hillbil

1141    When We Were Kings (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.045454545454546
1142    Hard Eight (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.0
1143    Quiet Room, The (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6666666666666665
1144    Blue Chips (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.6666666666666665
1145    Calendar Girl (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
1146    My Family (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6666666666666665
1147    Tom & Viv (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.3333333333333335
1148    Walkabout (1971)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.8461538461538463
1149    Last Dance (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.111111111111111
1150    Original Gangstas (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.142857142

1251    Contempt (Mépris, Le) (1963)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
1252    Tie That Binds, The (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.4285714285714284
1253    Gone Fishin' (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  1.8181818181818181
1254    Broken English (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.25
1255    Designated Mourner, The (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
1256    Designated Mourner, The (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  1.75
1257    Trial and Error (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.5217391304347827
1258    Pie in the Sky (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
1259    Total Eclipse (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.75
1260    Run of the Country, The (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.25
1261

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6
1375    Meet Wally Sparks (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.142857142857143
1376    Hotel de Love (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.25
1377    Rhyme & Reason (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.2
1378    Love and Other Catastrophes (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.857142857142857
1379    Hollow Reed (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.3333333333333335
1380    Losing Chase (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.25
1381    Bonheur, Le (1965)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
1382    Second Jungle Book: Mowgli & Baloo, The (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
1383    Squeeze (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  1.6666666666666667
1384    Roseanna's Grave (For Rosea

Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.0
1498    Grosse Fatigue (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.25
1499    Santa with Muscles (1996)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  5.0
1500    Prisoner of the Mountains (Kavkazsky Plennik) ...
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
1501    Naked in New York (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  1.5
1502    Gold Diggers: The Secret of Bear Mountain (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.1
1503    Bewegte Mann, Der (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6666666666666665
1504    Killer: A Journal of Murder (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.0
1505    Nelly & Monsieur Arnaud (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.6666666666666665
1506    Three Lives and Only One Death (1996)
Name: title, dtype: object -- AVERAGE RATING OF

1621    Paris, France (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.3333333333333335
1622    Cérémonie, La (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.0
1623    Hush (1998)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  2.0
1624    Nightwatch (1997)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.0
1625    Nobody Loves Me (Keiner liebt mich) (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  1.0
1626    Wife, The (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
1627    Lamerica (1994)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.75
1628    Nico Icon (1995)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  4.0
1629    Silence of the Palace, The (Saimt el Qusur) (1...
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.0
1630    Slingshot, The (1993)
Name: title, dtype: object -- AVERAGE RATING OF MOVIE =  3.5
1631    Land and Freedom (Tierra y libertad) (1995)
Name:

movie_id 328 number of ratings 1010
movie_id 329 number of ratings 150
movie_id 330 number of ratings 124
movie_id 331 number of ratings 400
movie_id 332 number of ratings 495
movie_id 333 number of ratings 902
movie_id 334 number of ratings 215
movie_id 335 number of ratings 58
movie_id 336 number of ratings 113
movie_id 337 number of ratings 54
movie_id 338 number of ratings 242
movie_id 339 number of ratings 153
movie_id 340 number of ratings 678
movie_id 341 number of ratings 34
movie_id 342 number of ratings 152
movie_id 343 number of ratings 384
movie_id 344 number of ratings 201
movie_id 345 number of ratings 221
movie_id 346 number of ratings 459
movie_id 347 number of ratings 481
movie_id 348 number of ratings 89
movie_id 349 number of ratings 90
movie_id 350 number of ratings 129
movie_id 351 number of ratings 54
movie_id 352 number of ratings 55
movie_id 353 number of ratings 34
movie_id 354 number of ratings 251
movie_id 355 number of ratings 120
movie_id 356 number of rati

movie_id 953 number of ratings 73
movie_id 954 number of ratings 39
movie_id 955 number of ratings 167
movie_id 956 number of ratings 166
movie_id 957 number of ratings 4
movie_id 958 number of ratings 53
movie_id 959 number of ratings 222
movie_id 960 number of ratings 81
movie_id 961 number of ratings 123
movie_id 962 number of ratings 72
movie_id 963 number of ratings 176
movie_id 964 number of ratings 30
movie_id 965 number of ratings 77
movie_id 966 number of ratings 109
movie_id 967 number of ratings 38
movie_id 968 number of ratings 67
movie_id 969 number of ratings 285
movie_id 970 number of ratings 26
movie_id 971 number of ratings 125
movie_id 972 number of ratings 89
movie_id 973 number of ratings 14
movie_id 974 number of ratings 97
movie_id 975 number of ratings 139
movie_id 976 number of ratings 22
movie_id 977 number of ratings 132
movie_id 978 number of ratings 68
movie_id 979 number of ratings 112
movie_id 980 number of ratings 70
movie_id 981 number of ratings 14
movi

movie_id 1578 number of ratings 4
movie_id 1579 number of ratings 1
movie_id 1580 number of ratings 1
movie_id 1581 number of ratings 1
movie_id 1582 number of ratings 1
movie_id 1583 number of ratings 1
movie_id 1584 number of ratings 1
movie_id 1585 number of ratings 5
movie_id 1586 number of ratings 1
movie_id 1587 number of ratings 1
movie_id 1588 number of ratings 4
movie_id 1589 number of ratings 12
movie_id 1590 number of ratings 4
movie_id 1591 number of ratings 19
movie_id 1592 number of ratings 18
movie_id 1593 number of ratings 4
movie_id 1594 number of ratings 9
movie_id 1595 number of ratings 2
movie_id 1596 number of ratings 2
movie_id 1597 number of ratings 15
movie_id 1598 number of ratings 15
movie_id 1599 number of ratings 5
movie_id 1600 number of ratings 15
movie_id 1601 number of ratings 1
movie_id 1602 number of ratings 10
movie_id 1603 number of ratings 3
movie_id 1604 number of ratings 4
movie_id 1605 number of ratings 12
movie_id 1606 number of ratings 2
movie_

**Question 3**:
* get the average rating per user
* list all occupations and if they are male or female dominant

In [295]:
# get the average rating per user
user_id_and_rating = ['user_id', 'rating']
b = ratings[user_id_and_rating]
sorted_user_id = b.sort_values(by=['user_id'])
user_means = np.array([])
for i in range(1,944):
    mean = np.mean(sorted_user_id[sorted_user_id.user_id == i].rating)
    user_means = np.append(user_means, mean)
for i, rat in enumerate(user_means, 1):
    print("user id = ", i," rating =", rat)


# list all occupations and if they are male or female dominant
occ = users["occupation"].unique()
f = 0
m = 0
print("Occupations = ", users["occupation"].unique()) # this prints all of the different types of occupations
for i in occ:
    m = users[(users.sex == "M") & (users.occupation == i)].user_id.count()
    f = users[(users.sex == "F") & (users.occupation == i)].user_id.count()
    if m > f: 
        print("males are more dominant in ", i)
    elif m < f:
        print("females are more dominant in ", i)
    else:
        print("equal domination of both genders")

print('number of male users: ')
print(sum(users["sex"] == "M"))

print('number of female users: ')
print(sum(users['sex'] == 'F'))

user id =  1  rating = 3.610294117647059
user id =  2  rating = 3.7096774193548385
user id =  3  rating = 2.7962962962962963
user id =  4  rating = 4.333333333333333
user id =  5  rating = 2.874285714285714
user id =  6  rating = 3.6350710900473935
user id =  7  rating = 3.965260545905707
user id =  8  rating = 3.7966101694915255
user id =  9  rating = 4.2727272727272725
user id =  10  rating = 4.206521739130435
user id =  11  rating = 3.4640883977900554
user id =  12  rating = 4.392156862745098
user id =  13  rating = 3.09748427672956
user id =  14  rating = 4.091836734693878
user id =  15  rating = 2.875
user id =  16  rating = 4.328571428571428
user id =  17  rating = 3.0357142857142856
user id =  18  rating = 3.88086642599278
user id =  19  rating = 3.55
user id =  20  rating = 3.1041666666666665
user id =  21  rating = 2.670391061452514
user id =  22  rating = 3.3515625
user id =  23  rating = 3.6357615894039736
user id =  24  rating = 4.323529411764706
user id =  25  rating = 4.0

user id =  543  rating = 3.532994923857868
user id =  544  rating = 2.806451612903226
user id =  545  rating = 3.506172839506173
user id =  546  rating = 3.8983050847457625
user id =  547  rating = 3.652173913043478
user id =  548  rating = 3.670967741935484
user id =  549  rating = 3.72
user id =  550  rating = 3.675
user id =  551  rating = 3.802395209580838
user id =  552  rating = 3.0952380952380953
user id =  553  rating = 4.17
user id =  554  rating = 3.5701754385964914
user id =  555  rating = 4.019230769230769
user id =  556  rating = 4.204545454545454
user id =  557  rating = 3.7547169811320753
user id =  558  rating = 4.2
user id =  559  rating = 3.5789473684210527
user id =  560  rating = 3.396039603960396
user id =  561  rating = 3.0252100840336134
user id =  562  rating = 3.5416666666666665
user id =  563  rating = 3.7666666666666666
user id =  564  rating = 3.411764705882353
user id =  565  rating = 4.542857142857143
user id =  566  rating = 3.442953020134228
user id =  5

females are more dominant in  librarian
females are more dominant in  homemaker
males are more dominant in  artist
males are more dominant in  engineer
males are more dominant in  marketing
males are more dominant in  none
females are more dominant in  healthcare
males are more dominant in  retired
males are more dominant in  salesman
males are more dominant in  doctor
number of male users: 
670
number of female users: 
273


**Question 4**:
- produce a 1-page document that uses a combination of text, tables, and figures that provide some interesting insights about the Movie Lens data. You should feel free to use outside sources to produce the report, as long as you acknowledge your sources. 

In [None]:
print("hello")

## Lab 3.C. HTML Data

In this part of the lab, you will be also be working on an exercise that is a slightly modified and shortened version of https://github.com/cs109/2015/blob/master/Lectures/02-DataScrapingQuizzes.ipynb. In particular, you will learn how to load and analyze html data.

HTML:
* HyperText Markup Language
* standard for creating webpages
* HTML tags 
    - have angle brackets
    - typically come in pairs


In [None]:
htmlString = """<!DOCTYPE html>
<html>
  <head>
    <title>This is a title</title>
  </head>
  <body>
    <h2> Test </h2>
    <p>Hello world!</p>
  </body>
</html>"""

htmlOutput = HTML(htmlString)
htmlOutput

Useful Tags:

* heading
`<h1></h1> ... <h6></h6>`

* paragraph
`<p></p>` 

* line break
`<br>` 

* link with attribute

`<a href="http://www.example.com/">An example link</a>`

### Scraping with Python:

Example of a simple webpage: http://www.crummy.com/software/BeautifulSoup

Good news: 
    - some browsers help
    - look for: inspect element
    - need only basic html
    - try 'Ctrl-Shift I' in Chrome
    - try 'Command-Option I' in Safari
   
Different useful libraries:
    - urllib
    - beautifulsoup
    - pattern
    - soupy
    - LXML
    - ...
 
The following cell just defines a url as a string and then reads the data from that url using the `urllib` library. If you uncomment the print command you see that we got the whole HTML content of the page into the string variable source.

In [None]:
url = 'http://www.crummy.com/software/BeautifulSoup'
source = requests.get(url).text
print(source)

**Question 5**:

* Is the word 'Alice' mentioned on the beautiful soup homepage?
* How often does the word 'Soup' occur on the site?
    - hint: use `.count()`
* At what index occurs the substring 'alien video games' ?
    - hint: use `.find()`

In [None]:
## is 'Alice' in source?

## count occurences of 'Soup'

## find index of 'alien video games'

**Beautiful Soup**

* designed to make your life easier
* many good functions for parsing html code

Some examples:

In [None]:
## get bs4 object
soup = bs4.BeautifulSoup(source)
 
## compare the two print statements
#print soup
#print soup.prettify()

## show how to find all a tags
soup.findAll('a')

## ***Why does this not work? ***
#soup.findAll('Soup')

More examples:

In [None]:
## get attribute value from an element:
## find tag: this only returns the first occurrence, not all tags in the string
first_tag = soup.find('a')

## get attribute `href`
first_tag.get('href')

## get all links in the page
link_list = [l.get('href') for l in soup.findAll('a')]
link_list

In [None]:
## filter all external links
# create an empty list to collect the valid links
external_links = []

# write a loop to filter the links
# if it starts with 'http' we are happy
for l in link_list:
    if l[:4] == 'http':
        external_links.append(l)

# this throws an error! It says something about 'NoneType'

In [None]:
# lets investigate. Have a close look at the link_list:
link_list

# Seems that there are None elements!
# Let's verify
#print sum([l is None for l in link_list])

# So there are two elements in the list that are None!

In [None]:
# Let's filter those objects out in the for loop
external_links = []

# write a loop to filter the links
# if it is not None and starts with 'http' we are happy
for l in link_list:
    if l is not None and l[:4] == 'http':
        external_links.append(l)
        
external_links

*Note*: The above `if` condition works because of lazy evaluation in Python. The `and` statement becomes `False` if the first part is `False`, so there is no need to ever evaluate the second part. Thus a `None` entry in the list gets never asked about its first four characters. 

In [None]:
# and we can put this in a list comprehension as well, it almost reads like 
# a sentence.

[l for l in link_list if l is not None and l.startswith('http')]

Parsing the Tree:

In [None]:
# redifining `s` without any line breaks
s = """<!DOCTYPE html><html><head><title>This is a title</title></head><body><h3> Test </h3><p>Hello world!</p></body></html>"""
## get bs4 object
tree = bs4.BeautifulSoup(s)

## get html root node
root_node = tree.html

## get head from root using contents
head = root_node.contents[0]

## get body from root
body = root_node.contents[1]

## could directly access body
tree.body

**Question 6**:

* Find the `h3` tag by parsing the tree starting at `body`
* Create a list of all __Hall of Fame__ entries listed on the Beautiful Soup webpage
    - hint: it is the only unordered list in the page (tag `ul`)

In [None]:
## get h3 tag from body


## use ul as entry point


## get hall of fame list from entry point
## skip the first entry 

## reformat into a list containing strings
## it is ok to have a list of lists


`tmp` now is actually a list of lists containing the hall of fame entries. 
Here is some advanced Python on how to print really just one entry per list item.

The cool things about this are: 
* The use of `""` to just access the `join` function of strings.
* The `join` function itself
* that you can actually have two nested for loops in a list comprehension

In [None]:
test =  ["".join(str(a) for a in sublist) for sublist in tmp]
print('\n'.join(test))

**Question 7**:
- Explain in detail what is Python doing in the previous line

**Question 8**:
- Plot a histogram of the count of the 20 most common words in the html file
- Plot a histogram of the count of the 20 most common words in the visible part (what is displayed in the browser) of the html file

**Deliverable**: For Lab 3.B and 3.C submit a modified version of this .ipynb file that contains all the answers to the quesitons