# Introduction to Python

## Variables

_Variables_ are the way in which data is stored and accessed when working in Python. Variables can have arbitrary names, and they can store many different kinds of data.

In [1]:
x = 2

In [2]:
y = 5

In [3]:
x

2

In [4]:
y

5

In [5]:
x + y

7

In [6]:
xy = 'Hey'

In [7]:
xy

'Hey'

Variables are stored in _memory_, which is not persistent. That means that every time the "kernel" of the python process shuts down, it forgets all variables. The kernel will shut down for a variety of reasons. For example, if you shut down you JupyterHub server, the kernel will shut down. If you run out of memory, the kernel shuts down. Sometimes you restart the kernel becuase something is funny with your notebook.

Anytime this happens, you will lose all variables, and you will have to recreate them by running the relevant code again. For the purposes of this class, this will mostly be fine. It won't take that long to recreate variables, so you shouldn't worry about having to do this. Generally, it is good practice to structure your notebook such that you can just hit "Run All Cells" when you start it up, and it gets you to wherever you need to be to keep on with the project.

## Operators

Operators are generally symbols that act on variables or pieces of data. The structure is typically `variable operator other_variable`, much like formulas in math. Below are common operators.

### Arithmetic Operators

| Symbol | Task Performed |
|----|---|
| +  | Addition |
| -  | Subtraction |
| /  | division |
| *  | multiplication |
| **  | to the power of |

In [8]:
1+2

3

In [9]:
2-1

1

In [10]:
1*2

2

In [11]:
1/2

0.5

Standard math operators work as expected on variables that represent numbers.

In [12]:
a = 2
b = 3

In [13]:
a + b

5

In [14]:
a * b

6

In [15]:
a ** b # a to the power of b (a^b does something completely different!)

8

In [16]:
a / b

0.6666666666666666

### String Operators

Strings also have operators. You can add them:

In [17]:
print('Hello,' + ' ' + 'World!')

Hello, World!


You can even multiply them:

In [18]:
print("hello"*3)

hellohellohello


### Relational Operators

Something that we will use a lot are relational operators. They compare things and return either `True` or `False`, and we can use these truth values to do various things.

| Symbol | Task Performed |
|----|---|
| == | True, if it is equal |
| !=  | True, if not equal to |
| < | less than |
| > | greater than |
| <=  | less than or equal to |
| >=  | greater than or equal to |

In [19]:
z = 1

Note that `=` and `==` are different things. `=` sets a variable equal to something. `==` tests if two things are equal, but it doesn't set anything equal.

In [20]:
z == 1

True

In [21]:
z == 0

False

In [22]:
z > 1

False

In [23]:
z >= 1

True

Boolean operators take two boolean values (either `True` or `False`) and return a boolean values for two variables a and b, the resulting output for the `and` operator is,

| a | b | a and b |
|----|---|---|
| `True` | `True` | `True` |
| `False`  | `True` | `False` |
| `True` | `False` | `False` |
| `False` | `False` | `False` |

I.e., it's only `True` if both inputs are `True`. For the `or` operator, it is `True` if at least one of the inputs is `True`:

| a | b | a or b |
|----|---|---|
| `True` | `True` | `True` |
| `False`  | `True` | `True` |
| `True` | `False` | `True` |
| `False` | `False` | `False` |

A simple but important boolean operator is `not`,

| a | not a |
|----|---|
| `True` | `False` |
| `False`  | `True` |

In [24]:
a = (1 > 3)
b = (3 == 3)

In [25]:
a

False

In [26]:
b

True

In [27]:
a or b

True

In [28]:
a and b

False

In [29]:
not a

True

## Functions

These will be very familiar to anyone who has programmed in any language or used excel extensively, and work like you
would expect.

In [30]:
type(3)

int

In [31]:
len('hello')

5

In [32]:
round(3.3)

3

__TIP:__ To find out what a function does, you can type it's name and then a question mark to
get a pop up help window.

In [33]:
?round
round(3.14159, 2)

3.14

[0;31mSignature:[0m [0mround[0m[0;34m([0m[0mnumber[0m[0;34m,[0m [0mndigits[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.  Otherwise
the return value has the same type as the number.  ndigits may be negative.
[0;31mType:[0m      builtin_function_or_method

__TIP:__ Many useful functions are not in the Python built in library, but are in external
packages. These need to be imported into your Python notebook (or program) before
they can be used.

In [34]:
import numpy as np

These functions can then be accessed by using `np.FUNCTION_NAME`. Some examples of numpy functions and "things":

In [35]:
print(np.sqrt(4))
print(np.pi)  # Not a function, just a variable
print(np.sin(np.pi))

2.0
3.141592653589793
1.2246467991473532e-16


We can also import specific functions so that we don't have to use the prefix `np.` using the `from LIBRARY_NAME import FUNCTION_NAME` syntax.

In [36]:
from numpy import pi

In [37]:
print(pi)

3.141592653589793


## Methods

Before we get any farther into the Python language, we have to say a word about "objects". We
will not be teaching object oriented programming in this class, but you will encounter objects
throughout Python (in fact, even seemingly simple things like numbers and strings are actually
objects in Python).

In the simplest terms, you can think of an object as a small bundled "thing" that contains within
itself both data and functions that operate on that data. For example, strings in Python are
objects that contain a set of characters and also various functions that operate on the set of
characters. When bundled in an object, these functions are called "methods".

Instead of the "normal" `function(arguments)` syntax, methods are called using the
syntax `object.method(arguments)`.

In [38]:
a = 'hello, world'

In [39]:
type(a)

str

Objects have bundled methods. For example:

In [40]:
a.capitalize()

'Hello, world'

In [41]:
a.replace('l', 'X')

'heXXo, worXd'

You can combine operators and methods like so:

In [42]:
((a + " y'all " + a)*3).replace('l', 'X').replace('dh', 'd h')

"heXXo, worXd y'aXX heXXo, worXd heXXo, worXd y'aXX heXXo, worXd heXXo, worXd y'aXX heXXo, worXd"

Let's break it down. We first create the string `"hello, world y'all hello, world"` using `a + " y'all " + a`. Then we multiply that string three times with `(a + " y'all " + a)*3` to give us `"hello, world y'all hello, worldhello, world y'all hello, worldhello, world y'all hello, world"`. Then we replace all `l`s with `X`s. Then we replace `dh` with `d h` to give us the spaces in between strings. While this may look a little complicated, it quickly becomes second nature to parse this kind of syntax.

# Working with Data

We will primarily be working with structured sets of data. To do that, we will use a library called `pandas`.

In [43]:
import pandas as pd

The below line reads in a `csv` (or ***c***omma ***s***eparated ***v***alues) file as a pandas _dataframe_ and saves it into the variable named `movies_df`. This is a file where each line corresponds to a row of data, and each column for the row is separated by a comma.

In [44]:
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")

`movies_df` is a Pandas dataframe. We can verify that with the `type()` function we saw earlier.

In [45]:
type(movies_df)

pandas.core.frame.DataFrame

Dataframes come with a bunch of built in useful methods. For examples, we can use the `.head()` method to see the first few lines.

In [46]:
movies_df.head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


We can list the columns by accessing the `columns` _attribute_ of the dataframe.

In [47]:
movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

We can then get a specific column using the following syntax.

In [48]:
movies_df['Rank']

Title
Guardians of the Galaxy       1
Prometheus                    2
Split                         3
Sing                          4
Suicide Squad                 5
                           ... 
Secret in Their Eyes        996
Hostel: Part II             997
Step Up 2: The Streets      998
Search Party                999
Nine Lives                 1000
Name: Rank, Length: 1000, dtype: int64

We can get a specific row (or ***loc***ation) using the `.loc` method.

In [49]:
movies_df.loc["Secret in Their Eyes"]

Rank                                                                996
Genre                                               Crime,Drama,Mystery
Description           A tight-knit team of rising investigators, alo...
Director                                                      Billy Ray
Actors                Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...
Year                                                               2015
Runtime (Minutes)                                                   111
Rating                                                              6.2
Votes                                                             27585
Revenue (Millions)                                                  NaN
Metascore                                                          45.0
Name: Secret in Their Eyes, dtype: object

Dataframes also come with useful methods such as the `.describe()` method that gives quick summaries of the data in the dataframe.

In [50]:
movies_df.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


We can also use operators that we saw above for standard numbers. For example, if we would like to compute the revenue per minute of runtime, we could do the following.

In [51]:
movies_df['Revenue/Runtime'] = movies_df['Revenue (Millions)'] / movies_df['Runtime (Minutes)']

In [52]:
movies_df

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Revenue/Runtime
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,2.753140
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,1.019839
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,1.180513
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,2.502963
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,2.642439
...,...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585,,45.0,
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0,0.186596
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0,0.591939
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0,


We can then sort our data set by our new column `Revenue/Runtime`.

In [53]:
movies_df.sort_values(by='Revenue/Runtime', ascending=False).head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Revenue/Runtime
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Star Wars: Episode VII - The Force Awakens,51,"Action,Adventure,Fantasy",Three decades after the defeat of the Galactic...,J.J. Abrams,"Daisy Ridley, John Boyega, Oscar Isaac, Domhna...",2015,136,8.1,661608,936.63,81.0,6.886985
Jurassic World,86,"Action,Adventure,Sci-Fi","A new theme park, built on the original site o...",Colin Trevorrow,"Chris Pratt, Bryce Dallas Howard, Ty Simpkins,...",2015,124,7.0,455169,652.18,59.0,5.259516
Finding Dory,120,"Animation,Adventure,Comedy","The friendly but forgetful blue tang fish, Dor...",Andrew Stanton,"Ellen DeGeneres, Albert Brooks,Ed O'Neill, Kai...",2016,97,7.4,157026,486.29,77.0,5.013299
Avatar,88,"Action,Adventure,Fantasy",A paraplegic marine dispatched to the moon Pan...,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...",2009,162,7.8,935408,760.51,83.0,4.694506
The Avengers,77,"Action,Sci-Fi",Earth's mightiest heroes must come together an...,Joss Whedon,"Robert Downey Jr., Chris Evans, Scarlett Johan...",2012,143,8.1,1045588,623.28,69.0,4.358601


We can also use the logical operators to pull out specific rows. Suppose that we would like to see all movies made by director James Gunn.

In [54]:
movies_df['Director'] == 'James Gunn'

Title
Guardians of the Galaxy     True
Prometheus                 False
Split                      False
Sing                       False
Suicide Squad              False
                           ...  
Secret in Their Eyes       False
Hostel: Part II            False
Step Up 2: The Streets     False
Search Party               False
Nine Lives                 False
Name: Director, Length: 1000, dtype: bool

The above tells us for every movie in our database whether or not it was made by James Gunn. This isn't the best way to view the data, so we can instead use the result to pull out only the rows for which the director was James Gunn.

In [55]:
movies_df[movies_df['Director'] == 'James Gunn']

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Revenue/Runtime
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,2.75314
Slither,909,"Comedy,Horror,Sci-Fi","A small town is taken over by an alien plague,...",James Gunn,"Nathan Fillion, Elizabeth Banks, Michael Rooke...",2006,95,6.5,64351,7.77,69.0,0.081789
Super,938,"Comedy,Drama",After his wife falls under the influence of a ...,James Gunn,"Rainn Wilson, Ellen Page, Liv Tyler, Kevin Bacon",2010,96,6.8,64535,0.32,50.0,0.003333


There are lots of other things that can be done with dataframes. The syntax can be a little difficult at first, but as we work with dataframes, and other components of the standard Python data science toolkit, this syntax will become familiar.