# All material ©2019, Alex Siegman

---

# An Introduction to Python and Pandas 

## A few basics regarding Jupyter Notebooks (formerly iPython Notebooks)

<br>

1. To execute a cell, hit Shift+Enter <br><br>

2. If you want to execute a cell, and add an aditional, blank cell below it, hit Option+Enter <br><br>

3. On the top of the window you will see this cell says 'Markdown' - this means you can type as if you would in any word processor. To execute any code in Python, you must ensure the cel is in 'Code' (you can ignore the other two options for now). 

    NB: You can use this cheat sheet in order to do things in Markdown like make headers or use bold/italics: https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf <br><br>

4. If you make a mistake and, for instance, start a 'while' loop with no ending and your computer gets stuck, you can click the square next to the 'Code/Markdown' dropdown menu above, which will stop the kernel (aka, the cell). <br><br>

5. When a cell is running, there will be a little '*' to its left. This means it's working on executing the code in the cell (the more complex the code, the longer it will take to execute). Once it is done, it will display a number. The numbers are really just there to show you the order in which you executed your cells, and nothing more. They can largely be ignored.

<br>

## So, what is Python, anyway? 

<br>

Python is one of many programming languages, and just like other languages, it has its pros and cons. 

For more on Python, they have a handy site: https://www.python.org/about/gettingstarted/, but for now just know that it is what's called an object-oriented, high-level programming language (read as: versatile and fairly basic). 

<br>

Python was released more than 25 years ago, in 1991, by a Guido van Rossum. In his own words:

_"...In December 1989, I was looking for a "hobby" programming project that would keep me occupied during the week around Christmas. My office ... would be closed, but I had a home computer, and not much else on my hands. I decided to write an interpreter for the new scripting language I had been thinking about lately: a descendant of ABC that would appeal to Unix/C hackers. I chose Python as a working title for the project, being in a slightly irreverent mood (and a big fan of Monty Python's Flying Circus)._

<br>


Finally, and perhaps most importantly, if you ever have any questions, you can ask me, or, visit https://stackoverflow.com/, quite possibly the most useful tool on the internet. Consider it the Google of coding questions. Input your search query (I.e., 'Convert string to integer') and you'll get hundreds if not thousands of answers!

<br>

P.S. You are most likely running Python 3.6.4, which is the latest vertsion (to check which version you are running, open your terminal and simply type "python"). Unfortunately, with each new update to Python there are some quirky changes. For instance, in previous Python versions, to print something you would say: 

    print "Hello, my name is Alex" 

Whereas in the latest version, you say: 

    print("Hello, my name is Alex") 

It may seem trivial, but it's anything but when you can't figure out why the code you've spent all night writing won't execute. 

P.P.S. Perhaps most important of all, if you are in a 'Code' cell and want to type something non-code, just put a '#' before it (demonstrated below). 

<br>

## What is Pandas? 

Pandas (https://pandas.pydata.org/) is an open source library that allows you to easily work with and analyze structured data in Python. 

## Why is Pandas Useful? 

Let's think about Stern Technologies, the fictitious organization around which this entire course is based. Remember, they are an AdTech organization {tktktk}. 

Imagine you have just been hired at Stern Technologies, and you have been asked to genreate a basic report on Stern Tech users.  

In [44]:
import pandas as pd # importing the Pandas library

#####  For a full list of all the possible Pandas operations:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

### In order to upload our CSV (Commas Separated Values) into our Jupyter Notebook, we need to point our machine into the right folder, so to speak. We can use Command Line commands to do so. For more on the Command Line, check out "Unix 101" in the GitHub repo. 

##### Because Stern Technologies isn't real, we're using a dataset that I (Alex) manufactured. Remember, NONE OF THESE CUSTOMERS ARE REAL. 

In [45]:
!pwd # AKA, 'Print Working Directory' – tells us what folder I am in right now.

# think of this like using your mouse to click into and out of folders on your desktop. 
# this is just bypassing that UI.

# the '!' allows you to execute a shell command. Basically, you're working as if you would in 
# your terminal, but from the Jupyter Notebook.

/Users/siegmanA/Desktop/Projects-in-Programming-Fall-2019/Python and Pandas (Class 2)


In [46]:
ls # list all of the files in the current directory (remember, directory = folder in UI world)..

Python and Pandas.ipynb  SternTech_UserData.csv


### Now that I'm in the right place, I can 'read' my CSV using the following command:

In [47]:
df = pd.read_csv('./SternTech_UserData.csv',encoding='utf-8') # read in the csv

# we are setting our dataset equal to the value 'df'.
# we can name this anything at all, it doesn't matter.
# df is commonplace, though, and stands for 'data frame'.

# you can ignore the 'encoding' piece for now, we'll get to that later on when we talk about web scraping. 

### Let's begin with a primary, exploratory analysis of our data...

In [48]:
pd.options.display.max_rows = 2000 # the way Jupyter Notebook tends to display the results of such queries isn't 
                                   # always helpful, but we can very easily change that.
                                   # this will ensure we can view up to 2,000 rows without seeing elipses in the UI
    
pd.options.display.max_columns = 50 # try commenting out this last line ('max_columns =50') then run the cell below
                                    # to see the difference this formatting makes 

In [49]:
df.head() # this gets the first five rows of data in your data frame 
          # df.tail() will give you the last five rows
          # if you want, you can choose any number - df.head(15) would give you the first 15 rows, for instance

Unnamed: 0.1,Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


In [50]:
list(df) # get a list of all the column names for your data frame

# we'll discuss this later on, but note that a list is comprised of comma-separated values inside of square brackets

['Unnamed: 0',
 'id',
 'company_size',
 'age',
 'sex',
 'clicked_on_ad',
 'ad_type',
 'location',
 'timestamp']

In [51]:
df.count() # get a count of the non-NA cells for each column

Unnamed: 0       50000
id               50000
company_size     50000
age              50000
sex              50000
clicked_on_ad    50000
ad_type          50000
location         50000
timestamp        50000
dtype: int64

### You'll note that we have a few missing values (which is totally normal) – we'll deal with those later.

In [52]:
df.info() # just some basic information on the data types (strings, integers, floats, et. cetera) for each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 9 columns):
Unnamed: 0       50000 non-null int64
id               50000 non-null object
company_size     50000 non-null object
age              50000 non-null int64
sex              50000 non-null object
clicked_on_ad    50000 non-null object
ad_type          50000 non-null object
location         50000 non-null object
timestamp        50000 non-null object
dtypes: int64(2), object(7)
memory usage: 3.4+ MB


### It's important to note that our timestamp values are being stored as 'non-null object's' and not as timestamps, as we'd like. So, let's change that: 

In [55]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [56]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
1,1,d0b45a01-b73d-4f8e-bfa8-c53ea75397f1,large,56,M,Yes,Culinary,SouthWest,2011-06-01 18:54:34.815634
2,2,1dc2e636-e19b-4d42-b228-df09cd009acb,large,20,F,No,Business,SouthEast,2013-07-16 00:24:47.888180
3,3,5d09d6d4-023e-4fa1-9559-89526679e885,large,55,F,Yes,Political,NorthWest,2010-06-25 12:13:51.369878
4,4,b69e54e3-fc89-4c0f-8bdb-280409db173e,medium,25,N,No,Tech,US,2010-09-22 07:53:12.454909


### A bit more primary exploratory analysis:

In [57]:
df.describe()

Unnamed: 0.1,Unnamed: 0,age
count,50000.0,50000.0
mean,24999.5,58.4073
std,14433.901067,23.679151
min,0.0,18.0
25%,12499.75,38.0
50%,24999.5,58.0
75%,37499.25,79.0
max,49999.0,99.0


In [58]:
df.sample() # get a random sample value from the data frame

Unnamed: 0.1,Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
18528,18528,079ec3c3-c2dc-43cb-ac83-d65c78a62656,medium,46,M,No,Political,SouthAmerica,2015-05-16 12:27:32.179814


In [59]:
df['age'].mean() # get the mean of a column

58.4073

In [60]:
df.sort_values(by="ad_type",ascending=False) # sort by highest total charges

Unnamed: 0.1,Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
7815,7815,fcc75b8d-a9d1-4df6-8349-c90b7a695ba5,small,29,N,Yes,Travel,US,2008-03-11 06:49:14.648792
27404,27404,4d7313e6-2800-4d34-9f67-2f71c49d2f51,small,69,N,Yes,Travel,SouthEast,2001-04-03 21:02:54.202994
27387,27387,4bfb0944-098a-4521-a1fa-254f50e64a5f,small,50,M,No,Travel,NorthEast,2018-03-03 09:36:25.584646
36350,36350,113854fd-9f5c-4dfe-a9ef-4fd433d9551a,startup,53,N,No,Travel,MidWest,2002-06-02 16:37:03.319019
20321,20321,053024ed-c1c6-4557-98d9-3933425299ba,startup,99,N,Yes,Travel,Mexico,2013-09-18 09:05:43.365742
5305,5305,b761725f-bbce-42fd-bdaa-86608cfdb72f,startup,84,M,Yes,Travel,NorthWest,2006-07-19 22:06:38.003395
27392,27392,3f517f60-7beb-4d2e-a3b6-d5981044c58a,startup,39,M,Yes,Travel,US,2002-12-11 21:40:36.307376
47284,47284,cf9d717b-3f33-4157-a783-7c2803e05e0c,medium,85,F,Yes,Travel,MidWest,2015-04-30 01:47:37.764275
27395,27395,3c4f74f4-d86c-4509-ad22-62d19c17c75b,medium,57,N,No,Travel,NorthWest,2018-10-28 17:15:52.796209
27397,27397,eeb7f0fe-3204-4660-a3ec-e423a4fcb3da,large,54,M,Yes,Travel,SouthAmerica,2014-02-25 23:38:46.273337


In [62]:
df[df['age'] > 90] # see any rows where age > 90

Unnamed: 0.1,Unnamed: 0,id,company_size,age,sex,clicked_on_ad,ad_type,location,timestamp
0,0,081217b4-1cf5-4657-8287-6db1b75462e4,large,92,M,Yes,Business,MidWest,2018-08-26 06:00:27.124290
16,16,346471ed-a4e7-46b3-9418-7dff84b50132,large,95,F,Yes,Culinary,NorthEast,2006-09-30 02:14:29.869183
37,37,d6615d05-94d7-49ab-9c5f-707338a213b4,startup,98,F,No,Political,US,2019-04-30 18:09:53.163624
44,44,e1c3a6e2-959a-4071-a802-1dd6801ef79c,startup,93,M,Yes,Real Estate,Canada,2011-07-24 23:22:18.105688
49,49,166279ee-da93-4d8e-a65f-28fbf8c46941,medium,97,F,Yes,Travel,SouthWest,2015-04-06 17:30:12.456716
61,61,2ecb87dc-ef9d-4eab-a28a-a4a0eb4ae951,large,92,N,Yes,Tech,NorthEast,2001-03-15 04:14:13.841541
73,73,8467bbf4-6586-4da2-bcbd-8e6ea57be2b1,startup,95,M,No,Fashion,Canada,2003-07-12 06:29:47.012728
88,88,2766d90e-bb83-43e0-926b-ea338284bbb6,large,91,F,Yes,Business,SouthWest,2005-11-01 12:55:01.462902
96,96,9b257b89-9ce1-455e-be96-b165bd240a9c,large,95,M,No,Fashion,MidWest,2005-10-13 16:58:23.404077
122,122,0f1cbe51-53cc-46e2-abaa-183d041aedf8,large,99,M,No,Fashion,SouthWest,2010-09-28 15:24:32.509028


### Now, let's imagine that we want to know TKTK:

In [43]:
# Top_Payers = df.nlargest(11,columns=[''])
# Top_Payers

### And, last but not least, a quick bit of data visualization. We'll delve much further into data viz for exploratory analysis later in this course. 

In [44]:
# import pandas as pd
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
# %matplotlib inline

In [45]:
# sns.pairplot(df.dropna())

In [46]:
# sns.heatmap(df.corr(),annot=True)