# Pandas - Build Datasets

![pandas](https://pandas.pydata.org/_static/pandas_logo.png)

Pandas is the de-facto way to analyze data in python, especially when it comes to business analytics and data science.

Pandas is built on top of numpy and helps us keep the Excel-like metaphor going as it helps us analyze the data in a tablular format, the same format that we are used to seeing in Excel with rows and columns.

Almost always, we can think of pandas **dataframes** as:

- rows = observations, rows in a database, etc

- columns = features, attributes, characteristics of the observations

![df-overview](https://cdn-images-1.medium.com/max/1600/1*6p6nF4_5XpHgcrYRrLYVAw.png)

![](http://bookdata.readthedocs.io/en/latest/_images/base_01_pandas_5_0.png)

# Setup

In [25]:
# import pandas -- the convention is to use pd
import numpy as np
import pandas as pd

# set the seed for reproducibility
np.random.seed(12345)

# Pandas Series

Let's step back and build on the fact that we have been working with lists and numpy arrays, so let's continue that with the** pandas series**, which again, we can think of as a single column within Excel.

## Generate from a list

In [26]:
# create a list from a range object with type conversion
my_list = list( range(10) )
my_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [27]:
## convert to the pandas series
series_list = pd.Series(my_list)
series_list

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

We can see that the series also has an index.

In [28]:
# confirm the type
type(series_list)

pandas.core.series.Series

In [29]:
# get the average of the series, or the "column" in Excel terms
series_list.mean()

4.5

## Generate from a numpy array

In [30]:
## generate the numpy array
my_array = np.random.randint(1,100, size = 10)
my_array

array([99, 30,  2, 37, 42, 35, 30,  2, 60, 15])

In [31]:
## convert to a pandas series
series_array = pd.Series(my_array)
series_array

0    99
1    30
2     2
3    37
4    42
5    35
6    30
7     2
8    60
9    15
dtype: int64

In [32]:
type(series_array)

pandas.core.series.Series

In [33]:
# the mean
series_array.mean()

35.2

In [48]:
# a quick plot
series_array.plot

<pandas.plotting._core.SeriesPlotMethods object at 0x119c8ac50>

# Pandas Dataframes

## Manually build

In [11]:
# build from a dictionary
d = {'student': list(range(100)), 'grade': np.random.choice(a = ['A','B','C','D','F'], size=100, p=[.2, .4, .3, .08, .02]) }
d

{'student': [0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63,
  64,
  65,
  66,
  67,
  68,
  69,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  78,
  79,
  80,
  81,
  82,
  83,
  84,
  85,
  86,
  87,
  88,
  89,
  90,
  91,
  92,
  93,
  94,
  95,
  96,
  97,
  98,
  99],
 'grade': array(['C', 'C', 'C', 'D', 'A', 'A', 'B', 'C', 'C', 'C', 'D', 'C', 'C',
        'C', 'B', 'B', 'B', 'C', 'F', 'C', 'C', 'A', 'A', 'C', 'D', 'A',
        'B', 'B', 'B', 'A', 'C', 'C', 'C', 'B', 'C', 'A', 'B', 'B', 'B',
        'B', 'C', 'A', 'B', 'A', 'C', 'D', 'C', 'A', 'B', 'B', 'C', 'D',
        'B', 'B', 'B', 'A', 'A', 'C', 'C', 'B', 'D', 'C', 'C', 'B', 'B',
        'C', 'B', 'B', '

In [35]:
## make the dataframe and look at first few rows
df = pd.DataFrame(d)
df.head()

Unnamed: 0,student,grade
0,0,C
1,1,C
2,2,C
3,3,D
4,4,A


> just like with dictionaries, pandas just builds the dataframe.  For the most part, who cares, but we can make the order if we so choose.

In [36]:
# make a pandas dataframe
pd.DataFrame(d, columns =['student','grade']).head()


Unnamed: 0,student,grade
0,0,C
1,1,C
2,2,C
3,3,D
4,4,A


## Read files

Pandas can read all sorts of files, let's look what we have we can read with pd.read_<tab>

In [38]:
pd.read

AttributeError: module 'pandas' has no attribute 'read'

### CSV file from the web

In [15]:
## READ IN A CSV FILE FORM THE WEB

# define the url for our dataset
url = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"

# read the csv from the web into a pandas dataframe
cars = pd.read_csv(url)

# print out the first few ros
cars.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### CSV file from local

I am using Google Collab, so I am going to make sure that my dataset is available in my working directory.  

The `diamonds.zip` file is in the Session 4 **datasets** folder.

In [50]:
# run a command to list the files in our directory
%ls -R

[31m04-strings-booleans.ipynb[m[m*        [31m10-pandas.ipynb[m[m*
[31m05-Tuples-and-Lists.ipynb[m[m*        11-pandas-creating-data.ipynb
[31m06-Collab-File-Access.ipynb[m[m*      12-pandas-selection.ipynb
[31m07-Loops.ipynb[m[m*                   2-example-analysis-project.ipynb
[31m08-Functions.ipynb[m[m*               3-intro-to-python.ipynb
[31m08-Numpy.ipynb[m[m*                   README.md
[31m09-Numpy.ipynb[m[m*                   Untitled.ipynb
[31m09-dictionaries.ipynb[m[m*            [34mdatasets[m[m/
1-getting-started.ipynb           main.py

./datasets:
assignment1.py  [31mdiamonds.csv[m[m*   shots.csv


In [52]:
# define the columns from the file that we want to keep
COLS = ['carat','cut','color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

# replace the diamonds dataset by reading it in again and specifying the columns
diamonds = pd.read_csv("datasets/diamonds.csv", usecols=COLS)
diamonds.head(n=10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


### JSON file on the web

JSON is a file format that looks alot like a dictionary.

Click on the link below to see what the response from the API looks like

https://www.metaweather.com/api/location/search/?query=boston



In [53]:
# bring in the data -- note that below we are using read_json, which tells
# python that this is the format
boston = pd.read_json("https://www.metaweather.com/api/location/search/?query=boston")
boston

Unnamed: 0,latt_long,location_type,title,woeid
0,"42.358631,-71.056702",City,Boston,2367105


In [56]:
# if we wanted to, we COULD convert the pandas dataframe to a dictionary
weather = boston.to_dict()
weather

{'latt_long': {0: '42.358631,-71.056702'},
 'location_type': {0: 'City'},
 'title': {0: 'Boston'},
 'woeid': {0: 2367105}}

In [81]:
x = weather.get('latt_long')[0].split(',')[1] #makes it into a list the 1 calls from index 1 in the list 
x

'42.358631,-71.056702'

#### Quick exercise

Write a command to extract the longitude from the weather dictionary

> Hint:  remember our exercises from week 1 and how we can look for methods to operate on our data

# Explore the Dataframe

In [82]:
# remember shape?
cars.shape

(32, 12)

In [83]:
# top and bottom of the dataset
cars.head()


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [84]:
cars.tail()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
27,Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
28,Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
29,Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
30,Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
31,Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


In [85]:
# sample records
cars.sample()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
14,Cadillac Fleetwood,10.4,8,472.0,205,2.93,5.25,17.98,0,0,3,4


In [86]:
# a nice summary of the numerical columns
cars.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.090625,6.1875,230.721875,146.6875,3.596563,3.21725,17.84875,0.4375,0.40625,3.6875,2.8125
std,6.026948,1.785922,123.938694,68.562868,0.534679,0.978457,1.786943,0.504016,0.498991,0.737804,1.6152
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.58125,16.8925,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


In [0]:
# column names
cars.columns

> REMEMBER:  Column names act as an index too!

In [0]:
# get the row index
cars.index

In [0]:
# remember, if we wanted to, we can set the index
cars.index = cars['model']
cars.index

In [0]:
cars.head()

> We will learn how to handle the dupe column in a later lesson

In [87]:
# we can also look at the columns to determine if there are missing data
cars.isnull().sum()

model    0
mpg      0
cyl      0
disp     0
hp       0
drat     0
wt       0
qsec     0
vs       0
am       0
gear     0
carb     0
dtype: int64

> Note the chaining.  We are applying::

- isnull() method
- and then summing up the data of is null by column

> REMEMBER:  Boolean Logicals are also 0/1 for False/True

## Exercise

We have the diamonds dataset also loaded

### Question 1:

Print out a summary of the dataset

In [89]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


### Question 2:
What is the average for carat?

In [102]:
avg = [diamonds['carat'].mean(),diamonds['carat'].median()]
avg
c = diamonds['price']/diamonds['carat']
c.mean()


4008.3947962312695

### Question 3:

Your boss asks you to calculate the average price within the dataset.  

Use **two** different approaches to arrive at the answer.

> HINT:  the answer is 3932.799722

In [96]:
# method 1
diamonds['price'].mean()

3932.799721913237

In [97]:
# method 2
np.mean(diamonds['price'])

3932.799721913237