# Module 1: Python Programming
## A. Welcome to Jupyter (a.k.a. Ipython Notebooks)
Take a while to adjust your bearings. Study the icons above.

There are two major types of cells:

1) Markdown cells - simple text. One can do html tags like <b>BOLD</b> or latex like $\beta$.

2) Code cells - cells where we can run code.

<b>Shortcuts</b>

1) <b>CTRL-M</b> then <b>H</b> to see help

2) <b>CTRL-M</b> then <b>S</b> to save notebook

3) <b>CTRL-ENTER</b> to Run Code but stay in the same cell

4) <b>SHIFT-ENTER</b> to Run Code and advance to the next cell

5) Using <b>%pylab inline</b> preceeding everything else in the notebook imports already matplotlib and numpy. It also enables our graphics to be part of the notebook.

6) You can use <b>TAB</b> to see available functions. You can use <b>SHIFT-TAB</b> repeatedly for the documentation.

In [None]:
%pylab inline

## B. Variables and Data Types

Python uses five standard data types:

### Numbers

In [None]:
varNum = 123
pi = 3.14159

varNum is an Integer, thus it does not handle numbers with decimal places while pi is a Float where values in the decimal place are handled.

### Strings

In [None]:
varString = "Hello World!"
varText = 'This is a String'
print(varString)
print("The length of varString is",len(varString))

Strings may be declared with a single quote (') or double quote ("), some even use triple double quotes("""). One may use them interchangeable but some prefer to follow a specific format.

### Lists

In [None]:
varList = ["abc", 123]
print(varList)
print(len(varList))

In [None]:
print(varList[0])
print(len(varList[0]))

You can think of Lists as similar to ArrayLists where the index starts at 0 and you can obtain the contents of a list by using brackets that contain the index of the element. You may also append items in the list and remove them as well.

### Tuples

In [None]:
varTuple = ('abc', 123, "HELLO")
print(varTuple)
print(len(varTuple))
print(varTuple[0])

It may seem like there are no differences between Tuples and Lists other than Tuples use parenthesis while lists use brackets, but actually there are minor differences. For one thing, Tuples are fixed structures thus do not have the luxury of Lists to append or remove elements. Generally Lists have a lot of other functions readily available as opposed to using Tuples.

<b>HINT:</b> You can try to type <b><i>varList.</i></b> in one line as well as <b><i>varTuple.</i></b> and press <b>TAB</b> after the period (.) in order to view possible functions you can call from that variable. You may also try to press <b>CTRL + TAB</b> when the text cursor.

In [None]:
varList.append("HELLO")
print(varList)
print(len(varList))

However Tuples actually use less space in the memory as opposed to Lists, resulting in faster processing. One thing to take note of is that one would usually use Tuples when the size of the contents are static as opposed to Lists where one can use it to continuously modify the size and elements.

In [None]:
print(varList)
print(varList.__sizeof__())
print(varTuple)
print(varTuple.__sizeof__())

### Dictionaries

In [None]:
var = 3
varDict = {'first':1, '2':'2nd', 3:var}
varDict

You may also declare contents of dictionaries individually

In [None]:
varDict = {}
varDict['first'] = 1
varDict['2'] = '2nd'
varDict[3] = var
print(varDict[3])

If you have experience in using JavaScript Object Notation or JSON, Python's implementation of Dictionaries are quite similar to that. You may reference an element by inserting the label of the keypair.

## Arithmetic

Python uses basic arithmetic functions which are normally present on most if not all programming languages.

### Addition

In [None]:
a = 5 + 3
a

### Subtraction

In [None]:
a = 5 - 3
a

### Multiplication

In [None]:
a = 5 * 3
a

### Exponent

In [None]:
a = 5 ** 3
a

### Division

In [None]:
a = 5 / 3
a

### Modulus Division

In [None]:
a = 5 % 3
a

### Integer Division

In [None]:
a = 5 // 3
a

### Increment

In [None]:
a = 5
a += 1
a

### Decrement

In [None]:
a = 5
a -= 1
a

<b>NOTE:</b> Python does not support the increment/decrement syntax of <b>x++/x--</b> instead you may use the syntax of <b>x+=1/x-=1</b> which is similar to <b>x=x+1/x=x-1</b>

### String Concatenation

In [None]:
a = 'Hello ' + 'World!'
a

Strings may also be appended with the use of the plus <b>(+)</b> symbol

### Complex Expressions

In [None]:
a = 3 + 5 - 6 * 2 / 4
a

## Challenge! Write the following to code

$$ g(z) = \frac{1}{1+e^{-z}}  $$

1) z = 8, and e = 2.718 should be equal to 0.0003

2) z = 2, and e = 2.718 should be equal to 0.1192246961081721

<b><i>TRIVIA</i></b>: The value <b>e</b>, also called <b>Euler's number</b>, is a mathematical constant representing an irrational number that is approximately <b>2.71828</b>. Irrational, meaning the constant <b>e</b> is a real number that is unending and is unable to accurately be represented as a fraction, similar to that of <b>pi</b>.

## C. Control Statements and Data Structures
## Conditional Statements
In Python, curly brackets are not used to designate that multiple commands are inside a conditional statement, instead uniform spacing is used. Please take note however that the composition of the uniform spacing must be the same, such that if 4 spaces are being used, even though 4 spaces may have a visually similar result as a tab, interchanging them would produce an error statement
### Boolean Condition

In [None]:
x = True
if x:
    print("var x is True")
else:
    print("var x is False")

### String Condition

In [None]:
x = "Hello World!"

if x == 'Hello World!':
    print("var x is Hello World!")
else:
    print("var x is not Hello World!")

### Numerical Condition

In [None]:
x = 10

if x == '10':
    print("var x is a String")
elif x == 10:
    print("var x in an Integer")
else:
    print("var x is none of the above")

### Multiple Conditions

In [None]:
x = 10

if x > 5 and x < 15 and x == 10:
    print("var x is really 10!")
else:
    print("var x is not really 10")

In [None]:
x = 10

if x == 10 or x == 20:
    print("var x can be 10 or 20")
else:
    print("var x is not 10 nor 20")

## Loops
Similar with that of Conditional Statements, commands within a loop are designated by having a uniform spacing.
### For Loops

In [None]:
for var in range(0,5,2):
    print(var)

<b>NOTE:</b> The command <b>range(0,5,2)</b> is equivalent to all numbers from 0 incremented by 2 until it reaches the number less than 5

In [None]:
[v for v in range(1, 100, 5)]

<b>NOTE:</b> range([start], stop, [step])

In [None]:
aggregator = []

for i in range(1, 5):
    aggregator.append([v for v in gen_num_up_to(5)])
aggregator

### While Loops

In [None]:
var = 0
while var < 5:
    print(var)
    var += 2

### Nested Loops

In [None]:
x = 0
while x < 5:
    for y in range(0, x):
        print(y, end='')
    x+=1
    print()

Always take note that there should be a colon <b>(:)</b> on the line where one delcares the loop or condition

## Lists

In [None]:
pi = 3.14159
varList = [1, 2, 'A', 'B', 'Hello!', pi]
print(varList[0])

For more information regarding lists, this has been discussed on section B. Data Types and Variables.
You can chose to insert different datatypes in a single list.

In [None]:
print(varList[4])

You may call the content of the list through indexing

In [None]:
varList.append('World!')
print(varList[6])

You may also append items in the list. The example above shows you that when you added a new list item, it would be added towards the end of list

In [None]:
len(varList)

You may obtain the number of elements in a list by calling the <b>len()</b> function

In [None]:
print(varList[5])

In [None]:
varList.remove(pi)
print(varList[5])

Initially <b>varList[5]</b> was called and the result was 3.14159, however when the <b>remove()</b> function was called, it iterates though the list looking for the first match then erases that value.

## Dictionaries

In [None]:
var = "Hello World!"
varDict = {'first' : 123, 2 : 'abc', '3' : var, 4:['lista', 'listb']}
print(varDict['first'])

In [None]:
print(varDict[2])

In [None]:
print(varDict['3'])

In [None]:
print(varDict[4])

In [None]:
print(varDict[4][1])

In [None]:
len(varDict)

## List Generators and Comprehension

In [None]:
def gen_num_up_to(n):
    num = 0
    while num < n:
        yield num
        num += 1

In [None]:
gen_num_up_to(5)

In [None]:
varList = gen_num_up_to(5)
print([var for var in varList])

In [None]:
def gen_num_up_to(n):
    num = 0
    while num < n:
        yield num
        num += 2

varList = gen_num_up_to(5)
print([var for var in varList])

In [None]:
varList = range(0, 5, 2)
print([var for var in varList])

## Slicing

In [None]:
varList = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
varList[:5]

In [None]:
varList[5:]

In [None]:
varList[:-2]

In [None]:
varList[-2:]

In [None]:
print(varList[2:-2])

In [None]:
varList[2:8:2]

<b>NOTE</b> <i>list([start]: end : [
step])

## D. Functions

Functions use the following notation:

def <i>function_name</i>:<br>
<pre><i> input commands here </i></pre>

Here is a sample function. np.random.randint(a, [b]) outputs uniformly random values from $[a,b) $

In [None]:
def remainder(n, m):
    while True:
        if n - m < 0:
            return n
        else:
            n = n - m

In [None]:
remainder(10, 4)

## Challenge: Coin Flip
Create a function that simulates coin flips repeated n times. Make the function a generator with a parameter n for the number of coin flips. Use np.random.randint to simulate a coin flip.

After defining this function, output a list using a list comprehension similar to the above example.

# Vectors, Matrices and Computation
In Python, you can use <b>NUMPY</b> or <b>np</b> through the use of <b>import numpy as np</b> in order to easily use functions for vectors and matrices.

In [None]:
np.array(range(100))

It's the same as just: 

In [None]:
np.arange(0, 100)

In [None]:
print(np.arange(0, 100))

In [None]:
np.array([v for v in coin_flip(10)])

## Computing vector computations
### Vector to scalar

In [None]:
varArray = np.arange(0, 5)
varArray * 2

### Vector to vector: Dot Product

In [None]:
varArrayA = np.arange(0,5)
varArrayB = np.arange(5,10)

print(varArrayA)
print(varArrayB)
print(np.dot(varArrayA, varArrayB))

### Vector to vector: Element-wise multiplication

In [None]:
varArrayA * varArrayB

## Challenge: Compute the mean of 1000 coin flips
Recall that mean is computed thus:

$$ mean(\vec{X}) = \frac{1}{|\vec{X}|} \sum_{k}X_k$$

## Matrix to Scalar: Element-wise multiplication

In [None]:
mat_a = np.random.randint(0, 5, size=(4,4))
print(mat_a)
print(mat_a * 2)

## Matrix to Matrix: Matrix Multiplication 

In [None]:
mat_a * mat_a

## Advanced Matrix Operations

In [None]:
from scipy.linalg import eig
eig(mat_a)

# Enter Pandas

In [None]:
import pandas as pd

data = pd.read_csv("movie_metadata.csv")

In [None]:
data.shape

In [None]:
data.columns

## Slicing data frames

In [None]:
data[:4]

## Indexing Columns

In [None]:
data.director_name[:4]

In [None]:
cols = ["movie_title","director_name"]
data[cols][:5]

## Indexing Rows

In [None]:
data.ix[10:12]

## Find Movies by James Cameron

In [None]:
data[data.director_name == 'James Cameron']

## Sort films by gross earnings

In [None]:
sorted_data = data.sort_values(by="gross", ascending=False)
sorted_data[:5]

## Challenge! Get the top 5 films of Michael Bay

## Multiple Conditions: Find films from the Canada that have Hugh Jackman as the actor_1_name

In [None]:
data[(data['country'] == 'Canada') & (data['actor_1_name'] == 'Hugh Jackman')]

## Challenge! Find the actor who is the actor_1_name for the movie that grossed exactly 67344392. Then, find films whose actor_3_name is Piolo Pascual and actor_1_name is the person from Armageddon.

# Preprocessing

In [None]:
data.dtypes

## Convert actor_1_facebook_likes to integers

In [None]:
# will have an error
#data.actor_1_facebook_likes.astype(np.int64)

In [None]:
data['actor_1_facebook_likes'] = data['actor_1_facebook_likes'].fillna(0)
data['actor_1_facebook_likes'] = data['actor_1_facebook_likes'].astype(np.int64)
data.dtypes

## Apply function for each cell
Actually, one can think of a lambda as a nameless function.

In [None]:
data.actor_1_facebook_likes.apply(lambda x: np.sqrt(x))[:5]

## Clean titles


In [None]:
data.movie_title.tolist()[:5]

In [None]:
data.movie_title[:10].apply(
    lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore')).tolist()

In [None]:
# a weakness though is that it removes enye
print(data[data.movie_title.str.contains("ñ")]["movie_title"])
print("The following removed the enye:")
print(data[data.movie_title.str.contains("ñ")]["movie_title"].apply(
    lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore')).tolist())

In [None]:
data["movie_title"] = data["movie_title"].apply(
    lambda x: x.encode().decode('unicode_escape').encode('ascii','ignore'))

## Clean all

Replace values with 0. Of course, imputation could be done here, but other libraries are need for it.

We are adding a new reference to the original data. Only the new cells are allocated memory. The unchanged cells are referenced to the original.

In [None]:
cleaned_data = data.fillna(0)

## Data Summaries

In [None]:
cleaned_data.describe()

## Data Correlations

In [None]:
cleaned_data.corr()

## Outliers : Clipping to the 99th percentile
This is just one of the many rules-of-thumb used in practice. It doesn't always work, especially if one has too many outliers.

In [None]:
cleaned_data.duration.quantile(0.99)

In [None]:
cleaned_data['duration'] = np.clip(cleaned_data['duration'], 0, 189)
cleaned_data.describe()

## Output to CSV file

In [None]:
cleaned_data.to_csv('movie_metadata_cleaned.csv')

In [None]:
!head -n 2 movie_metadata_cleaned.csv
#!more movie_metadata_cleaned.csv P 2

# Aggregations
In this example, the group by statement does two things:

1) groups together the dataframe by title_year

2) the size() function has the title_year as the index

In [None]:
cleaned_data["title_year"] = cleaned_data["title_year"].astype(np.int64)
movies_per_year = cleaned_data.groupby("title_year").size()
movies_per_year[-5:]

In [None]:
like_per_year = cleaned_data.groupby("title_year")["movie_facebook_likes"].mean()
like_per_year[-5:]

# Data Visualization
## Matplotlib: Line Plot : Average Facebook Likes per Year
Plot takes in as the first parameter the x axis and the second, the y axis values.

plt.figure(figsize=(15,8)) is just the size of the plot.

In [None]:
# we're preempting seaborn for a better look-and-feel of the plots
import seaborn as sns

In [None]:
fig = plt.figure(figsize=(15,8))
years = like_per_year.index.tolist()[1:]
likes = like_per_year[1:]
plt.plot(years, likes)
plt.show()

## Matplotlib: Scatterplot : Gross VS Budget

In [None]:
fig = plt.figure(figsize=(15,8))
plt.scatter(cleaned_data["gross"], cleaned_data["budget"])
plt.show()

## Matplotlib: Histogram of IMDB scores

In [None]:
fig = plt.figure(figsize=(15,8))
plt.hist(cleaned_data["imdb_score"], bins=20)
plt.show()

## Pandas : Barplot of the gross earnings of the 10 movies with highest budget superimposed with their budget as a line graph

In [None]:
top_budget = cleaned_data.sort_values(by="budget", ascending=False)["movie_title"][:10].tolist()
top_budget_data = cleaned_data[cleaned_data["movie_title"].isin(top_budget)]

ax1 = top_budget_data.plot(x="movie_title", y="gross", kind="bar", figsize=(15,8))
ax2 = ax1.twinx()
top_budget_data.plot(x="movie_title", y="budget", kind="line", 
                     color='red', figsize=(15,8), ax=ax2)
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')
plt.show()

## Plot the histograms of imdb_scores according to different content rating Types  
We're finding out which content type tends to have the higher imdb_score.

In [None]:
import seaborn as sns

ax = sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'PG-13']["imdb_score"], 
                  color='red')
sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'R']["imdb_score"], 
             color='teal', ax=ax)
sns.distplot(cleaned_data[cleaned_data['content_rating'] == 'GP']["imdb_score"], 
             color='blue', ax=ax)

In [None]:
from pandas.tools.plotting import scatter_matrix
cols = ["num_critic_for_reviews", "imdb_score", "movie_facebook_likes"]
scatter_matrix(cleaned_data[cols], alpha=0.2, figsize=(20, 20), 
               diagonal='kde', marker='o')
plt.show()

In [None]:
from pandas.tools.plotting import scatter_matrix
cols = ["num_critic_for_reviews", "duration", "facenumber_in_poster", 
        "num_user_for_reviews", "budget", "imdb_score", "movie_facebook_likes", "gross"]
scatter_matrix(cleaned_data[cols], alpha=0.2, figsize=(20, 20), 
               diagonal='kde', marker='o')
plt.show()

## Seaborn : Conditional Formatting
For more, check out: https://pandas.pydata.org/pandas-docs/stable/style.html

In [None]:
import seaborn as sns

cm = sns.light_palette("green", as_cmap=True)
cols = ["movie_title", "imdb_score","gross"]
color_me = cleaned_data[cols][:10]
s = color_me.style.background_gradient(cmap=cm)
s

## Advanced Graphs: Plotting more than 2 variables

Let's take this slowly.

In [None]:
plt.figure(figsize=(15,8))
sizes = 1000*((cleaned_data["gross"] - min(cleaned_data["gross"])) 
              / max(cleaned_data["gross"] ) - min(cleaned_data["gross"] ))
colors = np.where(cleaned_data.genres.str.contains("Fantasy"), 'red', 'green')
plt.scatter(x=cleaned_data["movie_facebook_likes"], 
            y=cleaned_data["imdb_score"], c=colors,
            s = sizes)

plt.xlabel("Facebook Likes")
plt.ylabel("IMDB Score")
plt.xlim((0,100000))
plt.show()

## Seaborn : Line Plot with Regression : Voted users and reviews

In [None]:
sns.lmplot(x="num_voted_users", 
           y="num_user_for_reviews",
           data=cleaned_data, size=8)

In [None]:
sns.jointplot(x="num_voted_users", y="num_user_for_reviews", data=cleaned_data, size=8,
             kind="hex", color="#4CB391")

## Challenge! Fix the above plot by clipping by the Tukey's Test (k=1):
Use the np.clip function to bound the results as the following interval:
$$
{\big [}Q_{1}-k(Q_{3}-Q_{1}),Q_{3}+k(Q_{3}-Q_{1}){\big ]}
$$

## Seaborn : Barplot of the gross earnings of the last 10 years

In [None]:
cleaned_data["title_year"] = cleaned_data["title_year"].astype(np.int64)
latest_10_years = np.sort(cleaned_data["title_year"].unique())[-10:]

latest_movies_data=cleaned_data[cleaned_data["title_year"].isin(latest_10_years)]
plt.figure(figsize=(15,4))
sns.barplot(x="title_year", y="gross", data=latest_movies_data)
plt.xticks(rotation=45)
plt.show()

## Challenge : Get the top 10 directors with most movies directed and use a boxplot for their gross earnings

# Afternoon Activities

## Plot the following variables in one graph:

- num_critic_for_reviews
- IMDB score
- gross
- Steven Spielberg against others

## Compute Sales (Gross - Budget), add it as another column

## Which directors garnered the most total sales?

## Which actors garnered the most total sales?

## Plot sales and average likes as a scatterplot. Fit it with a line.

## Which of these genres are the most profitable? Plot their sales using different histograms, superimposed in the same axis.

- Romance
- Comedy
- Action
- Fantasy

## Standardization
Standardize sales using the following formula then save it to a new column.

$$
Z={X-\operatorname {E} [X] \over \sigma (X)}
$$

The first values should be: [2.612646, 0.026695, -0.246587, 0.975996, -0.020609]

## For each of movie, compute average likes of the three actors and store it as a new variable. Standardize. Read up on the mean function.

Store it as a new column, average_actor_likes.

## Create a linear hypothesis function

Create a function that takes (1) a scalar, (2) theta and (3) a bias variable to output a value as close as possible to gross.

$$
score = b + \sum_j{(\theta_j * x)}
$$

$$
score = \theta_1 * average\_actor\_likes + bias
$$

## Create an RMSE function
Create a function that compares two vectors and outputs the root mean squared error / deviation.

$$
\operatorname{RMSD}(\hat{\theta}) = \sqrt{\operatorname{MSE}(\hat{\theta})} = \sqrt{\operatorname{E}((\hat{\theta}-\theta)^2)}
$$

## Create the best possible thetas by brute-forcing against the RMSE function.

Create predictions for your entire dataset. Compare your predictions against the score. Achieve the smallest RMSE you can.

## Plot your best theta, bias variable against the imdb score for each movie

For a cleaner plot:

(1) compile your average_actor_likes, imdb_scores and predicted to a new dataframe

(2) limit the bounds of your predicted ratings

## Convert your hypothesis function to use more variables:

Don't forget to standardize your new variables.

$$
score = \theta_1 * average\_actor\_likes + \theta_2 * movie\_facebook\_likes + \theta_3 * sales + bias
$$

## Compile your theta values to a new pandas dataframe which consists of the following columns:

<table>
<th> $\theta_1$ </th> 
<th> $\theta_2$ </th>
<th> $\theta_3$ </th>
<th> $RMSE$ </th>

<tr>
<td>0.1</td><td>0.1</td><td>0.1</td><td>10000</td>
</tr>

<tr>
<td>0.2</td><td>0.2</td><td>0.2</td><td>2000</td>
</tr>

</table>


## Plot how each theta parameter influence the RMSE. Which one seems to be most influential?

# Advanced Activities

## Using Linear Regression (Ridge)
Find the best coefficients using Ridge regression