# Intro to Python Programming Using Pandas

Hello, and welcome to the Intro to Python Programming tutorial notebook! This notebook is adapted from an excellent tutorial called *Data Analysis Example: Analyzing Movie Ratings with Python* by Cagdas Yetkin. You can check out the original tutorial [here](https://codingnomads.co/blog/data-analysis-example-analyzing-movie-ratings-with-python/), and the corresponding GitHub code [here](https://github.com/CodingNomads/movie-analysis-python-pandas/blob/master/movie-analysis.ipynb).

The goal of this notebook is to get you quickly up to speed with some of the basics of Python programming (and programming in general!) and data manipulation. Instead of learning Python in a series of isolated and generic examples, we are going to try learning it with a contextualized example that will have immediate applications to scRNA-seq analysis.

How exactly does a tutorial based on movie ratings connect to scRNA-seq analysis? It all comes down to `pandas`, a popular Python library used for data analysis and manipulation. When you do scRNA-seq analysis using `scanpy`, you are using a `pandas` dataframe to store, manipulate, and analyze your data. So, if we can start learning how to use `pandas` now, some aspects of scRNA-seq analysis will make a lot more sense once we get to them!

# 1 Get and Inspect the Data

First, before we analyze any data, we have to have data to analyze. We are going to be using data from the [Movie Tweetings Project](https://github.com/sidooms/MovieTweetings/tree/master/latest). This data consists of movie ratings from Twitter since 2013, updated daily. The data was created from people who connected their IMDB profile with their Twitter accounts. Whenever they rated a movie on the IMDB website, an automated process generated a standard, well-structured tweet. We can use this data to learn and practice data analysis using Python.

Let's make a new directory in our */workspace/intro_to_programming/* folder called *data*. There are two ways you can do this. You can open a terminal, navigate into */workspace/intro_to_programming/* using the `cd` command, and then make a data directory by using the `mkdir` command (i.e., `mkdir data`). Or, you can click on the new folder icon in the left sidebar and change to label to *data*.

Once we have created our *data* directory, we need to download our data files. There are three:
* *movies.dat* (`https://github.com/sidooms/MovieTweetings/raw/master/latest/movies.dat`)
* *ratings.dat* (`https://github.com/sidooms/MovieTweetings/raw/master/latest/ratings.dat`)
* *users.dat* (`https://github.com/sidooms/MovieTweetings/raw/master/latest/users.dat`)

Let's download these using our terminal. Open a terminal, and navigate into the *data* directory. Now, use the `wget` command to download each of the three links (i.e., copy the *movies.dat* link, and run `wget <LINK>` in the terminal. Repeat for the *ratings.dat* and *users.dat* links). After you have downloaded all three files, if you type `ls -l` in the terminal, you should see those three files now listed in your data directory.

Congratulations! We have data! Let's take a peek and see what it looks like.

In your *data* directory, using the terminal type:
```
head -n3 users.dat
```

This returns the first three lines of the `users.dat` file. You should see something that looks like this:
```
1::139564917
2::17528189
3::522540374
```

Without any other information, we would have no way of knowing what exactly we are looking at. Luckily, we can consult the Movie Tweetings Project [README](https://github.com/sidooms/MovieTweetings/blob/master/README.md) file, which tells us that "in *users.dat* the first field is the *user_id* and the second one is *twitter_id*". So we have two data fields, which are separated by `::`. Data fields can be divided by all sorts of different separators, and it's good to know which one is used in that data you are working with. Most often, you will probably see commas or tabs used as separators.

**Exercise_1.1**</br>
Next, let's take a peek at the first 5 lines of the `movies.dat` file. You should see something like this:
```
0000008::Edison Kinetoscopic Record of a Sneeze (1894)::Documentary|Short
0000010::La sortie des usines Lumière (1895)::Documentary|Short
0000012::The Arrival of a Train (1896)::Documentary|Short
25::The Oxford and Cambridge University Boat Race (1895)::
0000091::Le manoir du diable (1896)::Short|Horror
```

**Exercise_1.2**</br>
Knowing that our fields our separated by `::`, how many fields do we have in our `movies.dat` file? What are the fields? (Hint: check out the README!)

**Exercise_1.3**</br>
Take a peek at the *last* 6 lines of the *ratings.dat* file. (Hint: instead of `head`, use `tail`!) You should see something like this:
```
71257::9784456::6::1595810413
71257::9893250::10::1613857551
71257::9898858::3::1585958452
71258::0172495::10::1587107015
71258::0414387::10::1587107852
71259::1623205::6::1362832655
```

How many fields are in *ratings.dat*? What are the fields?

Alright, we have some data and we now have a feel for what the data looks like. Up to this point, we have not used any Python yet. We have been using the same commands that you would use to navigate the file system on your computer. But don't worry, Python is coming up next!

# 2 Set Up Your Notebook

In order to analyze our data, we need to set up our coding environment (this Jupyter notebook!). This involves loading (or importing) all of the python modules, packages, and libraries we need via the [`import`](https://docs.python.org/3/reference/import.html) statement. A python [`module`](https://docs.python.org/3/tutorial/modules.html) is a file that can define functions, classes and variables, and also include runnable code. A `script` is an executable module (also sometimes called a `program` or `application`. A python package is a collection of modules under a common namespace. You can kind of think of a package like a file system directory and modules as the files in the directory, though this is an oversimplification. A `library` is a generic term for a bunch of code that was designed with the aim of being usable by many applications. It provides some generic functionality that can be used by specific applications. 

Using `import...as...` allows us to rename modules/packages/libraries as they are imported for more concise method calls downstream (for instance, if we said `import pandas` and wanted to use the `.read_csv()` pandas method, we would have to call this as `pandas.read_csv()`. If we said `import pandas as pd` we could instead call `pd.read_csv()`).

In [1]:
# These imports are necessary
import warnings # this is so can ignore annoying (and usually unimportant) warning messages

import pandas as pd # our main data analysis and manipulation library
import numpy as np # library to handle large, multidimensional arrays and matrices
import scipy as sc # library for scientific computing and technical computing

import matplotlib.pyplot as plt # plotting library
import seaborn as sns # another plotting library; built on top of matplotlib

# These adjustments are not necessary, but will make your analysis easier and better-looking
plt.style.use('fivethirtyeight') # make our plots stylized like those on fivethiryeight.com
pd.set_option('display.max_rows', 50) # display 50 max rows to make our DataFrame more readable/visible
pd.set_option('display.max_columns', 50) # display 50 max columns to make our DataFrame more readable/visible
warnings.filterwarnings('ignore') # have cleaner notebook without warning messages

# 3 Read in the Data

Now that our coding environment is set up, we can read our files into pandas dataframes. In order to do this, we will use the `read_csv` function in `pandas`. This function takes in a few parameters that we should pay attention to. We need to make sure that we define that the separators (or delimiters) are double colons `::`; give the column names, so that they will become headers in our dataframes; and convert the UNIX time in the *ratings.dat* file to a more readable datetime format. Let's read in our files one by one, starting with *users.dat*.

(***Did you know!*** How do you know what parameters a function might need? All Python libraries have an Application Programming Interface (API) (or at least the good ones do!), which is a set of definitions and protocols that tells you how to use the various functions included in a library. For example, [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) is `.read_csv()` page in the `pandas` API. You can see that it can take in way more parameters than we will provide. A *required* parameter is something like the *filepath_or_buffer* parameter. You have to pass `.read_csv()` some sort of file to read. An *optional* parameter is something like the *sep* parameter. You don't have to pass `.read_csv()` a delimiter, and if you don't it will default to a comma. In our case, we will pass `read_csv` a *sep* parameter.)

In [2]:
users = pd.read_csv('data/users.dat', sep='::', names=['user_id', 'twitter_id'])

The creates a `DataFrame()` object. We can look at the first few entries of our new *users* `DataFrame()` using `.head()` method. (Note that this is different from the *head* command used in the terminal, which is a Linux command.)

In [3]:
users.head()

Unnamed: 0,user_id,twitter_id
0,1,139564917
1,2,17528189
2,3,522540374
3,4,475571186
4,5,215022153


Sweet, that looks pretty good! Let's do the same thing with our other two files.

We'll do the *ratings.dat* file next. Similar to before, you will want to read in the data and save it into a data frame, define the separator, and pass in the column names. Additionally, you will also call the `.sort_values()` method on the dataframe right away, to sort your data by when the ratings have been created.

In [4]:
ratings = pd.read_csv('data/ratings.dat', sep='::',
                      names=['user_id', 'movie_id', 'rating', 'rating_timestamp']
                      ).sort_values("rating_timestamp") # sorting the dataframe by datetime

You will also want to convert the rating_timestamp values to actual datetime format, and you can do that in `pandas` like so:

In [5]:
ratings["rating_timestamp"] = pd.to_datetime(ratings["rating_timestamp"], unit='s')

Let’s peek into the first 5 rows of your newly created ratings dataframe:

In [6]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,rating_timestamp
146299,11570,2171847,6,2013-02-28 14:38:27
609296,47876,444778,8,2013-02-28 14:43:44
636479,49892,1411238,6,2013-02-28 14:47:18
674391,52623,1496422,7,2013-02-28 14:58:23
774473,60785,118799,5,2013-02-28 15:00:53


Two files down, one to go!

**Exercise_3.1**</br>
Read in the *movies.dat* file. What do the first 5 lines look like? (Hint: use `.read_csv()` to read in the data and save it into a dataframe, define the separator, and pass in the column names!)

In [9]:
# DELETE FOR EXERCISES
movies = pd.read_csv('data/movies.dat', sep='::',
                     names=['movie_id', 'movie_title', 'genres'])
movies.head()

Unnamed: 0,movie_id,movie_title,genres
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


Fantastic! All of our data is now read into our notebook, and we have some dataframes to play with. Next we will explore our data.

# 4 Exploration