# Data Analysis with Python and Pandas 

---

In [None]:
import pandas as pd # importing the Pandas library

#### For a full list of all the possible Pandas operations:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

---

## Data Upload

### The first step of data analysis is to actually get your data in the right place. In order to upload our CSV (Commas Separated Values) into our Jupyter Notebook, we need to point our machine into the right folder, so to speak. 

### We can use Command Line commands to do so. (For more on the Command Line, check out "Unix 101" in the GitHub repo). 

#### _*Please note that the following bash commands (!pwd and !ls) will not work if you are using Google Colab. If you are using Google Colab, on the left hand of your UI, click "Files" >> "Upload" >> and select your CSV_

In [None]:
!pwd # AKA, 'Print Working Directory' – tells us what folder I am in right now.

# think of this like using your mouse to click into and out of folders on your desktop. 
# this is just bypassing that UI.

# the '!' allows you to execute a shell command. Basically, you're working as if you would in 
# your terminal, but from the Jupyter Notebook.

In [None]:
!ls # list all of the files in the current directory (remember, directory = folder in UI world)..

## Now that I'm in the right place (I see the CSV is in this folder), I can 'read' my CSV using the following command:

In [None]:
df =  # read in the csv

# we are setting our dataset equal to the value 'df' for 'dataframe'.
# we can name this anything at all, it doesn't matter.
# df is commonplace, though, and stands for 'data frame'.

# you can ignore the 'encoding' piece for now, we'll get to that later on when we talk about web scraping. 

## Let's begin with a primary, exploratory analysis of our data...

In [None]:
pd.options.display.max_rows = 2000 # the way Jupyter Notebook tends to display the results of such queries isn't 
                                   # always helpful, but we can very easily change that.
                                   # this will ensure we can view up to 2,000 rows without seeing elipses in the UI
    
pd.options.display.max_columns = 50 # try commenting out this last line ('max_columns =50') then run the cell below
                                    # to see the difference this formatting makes 

In [None]:
 # this gets the first five rows of data in your data frame 
          # df.tail() will give you the last five rows
          # if you want, you can choose any number - df.head(15) would give you the first 15 rows, for instance

In [None]:
 # get a list of all the column names for your data frame

# we'll discuss this later on, but note that a list is comprised of comma-separated values inside of square brackets
# you can also use "df.columns" if you prefer, which will give you a similar output

In [None]:
 # lets see how the computer is looking at our data; for instance, as a string, integer, et. cetera

# please note that the "object" type is a string

In [None]:
# let's drop that "unnamed" column 



In [None]:
 # get the basic statitical metrics for a data frame

In [None]:
 # get a count of the non-NA cells for each column

In [None]:
 # see the non-NA cells for each value in a column

In [None]:
 # just some basic information on the data types (strings, integers, floats, et. cetera) for each column

## It's important to note that our timestamp values are being stored as 'non-null object's' and not as timestamps, as we'd like. So, let's change that: 

## A bit more primary exploratory analysis:

In [None]:
 # get a random sample value from the data frame

In [None]:
 # select a single column

In [None]:
 # select multiple columns

# .loc is used for labels/names

In [None]:
 # get information on a single row 

# .iloc is used for position numbers

In [None]:
 # get the value of the 7th column (ad_type) for the 4th row (3rd index)

## To reiterate: 'loc' gets rows or columns with particular _labels_ from the index, whereas iloc gets rows or columns at a particular _position_ in the index (aka, it only accepts integers). 

In [None]:
 # get the mean of a column

In [None]:
 # sort by age

In [None]:
 # see any rows where age < 21

## Exercise 1: How many 21 year-olds were served Culinary ads?

## Exercise 2: What is the most common company size in the SouthEast?

---

## Moving on, let's look at a larger data set from https://data.cityofnewyork.us/Health/New-York-City-Leading-Causes-of-Death/jb7j-dtam that details the leading causes of dath in NYC:

In [None]:
import requests

url = 'http://data.cityofnewyork.us/api/views/jb7j-dtam/rows.json'
results = requests.get(url).json() # reading in the json just as we did with our citibike info last week

In [None]:
# there are two main fields returned in the json, the meta that just describes the actual metadata, and the 
# data itself



## Let's create a DataFrame from our JSON data.

## Let's also add some column names:

In [None]:
 # this gives us the descriptions and names for the columns


In [None]:
 # we create a list of the column names 


In [None]:
 # now we pass in a list of those column names to our df


## We can drop a few of these columns too as they're really not too helpful:

---

## Above, the axis='columns' says that we are looking to drop columns. If we had axis='index' we would be dropping rows with the passed id's, with the ids for the row being the index value for that row.

## The inplace=True specifies that we will not be creating a new dataframe, but we just replace the current one, with the new dataframe that has fewer columns.

In [None]:
# let's drop those first three rows as they appear to be metadata that for some reason were included in our df



---

## It's important to note that we can always rename our columns using a dictionary:

## Now, we've spoken a bit about datatypes, and why it's important that our computer is viewing data as we need it to; for instance, a string as a string, an integer as an integer. 

## Remember that 'object' is a string in this case:

## We can pass the `errors` command to specify what should happen if we anticipate Pandas isn't going to oblige. From the [documentation of to_numeric](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html), we get:

* If ‘raise’, then invalid parsing will raise an exception
* If ‘coerce’, then invalid parsing will be set as NaN
* If ‘ignore’, then invalid parsing will return the input

## Last but not least, we can also mark some variables as categorical

## Exercise 3: What was the leading cause of death in 2014?

## Exercise 4: How many different causes of death were recorded in 2011?