# Pandas - Build Datasets

![pandas](https://pandas.pydata.org/_static/pandas_logo.png)

Pandas is the de-facto way to analyze data in python, especially when it comes to business analytics and data science.

Pandas is built on top of numpy and helps us keep the Excel-like metaphor going as it helps us analyze the data in a tablular format, the same format that we are used to seeing in Excel with rows and columns.

Almost always, we can think of pandas **dataframes** as:

- rows = observations, rows in a database, etc

- columns = features, attributes, characteristics of the observations

![df-overview](https://cdn-images-1.medium.com/max/1600/1*6p6nF4_5XpHgcrYRrLYVAw.png)

![](http://bookdata.readthedocs.io/en/latest/_images/base_01_pandas_5_0.png)

# Setup

In [1]:
# import pandas -- the convention is to use pd
import numpy as np
import pandas as pd

# set the seed for reproducibility
np.random.seed(12345)

# Pandas Series

Let's step back and build on the fact that we have been working with lists and numpy arrays, so let's continue that with the** pandas series**, which again, we can think of as a single column within Excel.

## Generate from a list

In [2]:
# create a list from a range object with type conversion
my_list = list( range(10) )
my_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [3]:
## convert to the pandas series
series_list = pd.Series(my_list)
series_list

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

We can see that the series also has an index.

In [4]:
# confirm the type
type(series_list)

pandas.core.series.Series

In [5]:
# get the average of the series, or the "column" in Excel terms
series_list.mean()

4.5

## Generate from a numpy array

In [6]:
## generate the numpy array
my_array = np.random.randint(1,100, size = 10)
my_array

array([99, 30,  2, 37, 42, 35, 30,  2, 60, 15])

In [7]:
## convert to a pandas series
series_array = pd.Series(my_array)
series_array

0    99
1    30
2     2
3    37
4    42
5    35
6    30
7     2
8    60
9    15
dtype: int64

In [8]:
type(series_array)

pandas.core.series.Series

In [9]:
# the mean
series_array.mean()

35.2

In [20]:
# a quick plot
series_array.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x11923a128>

# Pandas Dataframes

## Manually build

In [11]:
# build from a dictionary
d = {'student': list(range(100)), 'grade': np.random.choice(a = ['A','B','C','D','F'], size=100, p=[.2, .4, .3, .08, .02]) }
d

{'student': [0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57,
  58,
  59,
  60,
  61,
  62,
  63,
  64,
  65,
  66,
  67,
  68,
  69,
  70,
  71,
  72,
  73,
  74,
  75,
  76,
  77,
  78,
  79,
  80,
  81,
  82,
  83,
  84,
  85,
  86,
  87,
  88,
  89,
  90,
  91,
  92,
  93,
  94,
  95,
  96,
  97,
  98,
  99],
 'grade': array(['C', 'C', 'C', 'D', 'A', 'A', 'B', 'C', 'C', 'C', 'D', 'C', 'C',
        'C', 'B', 'B', 'B', 'C', 'F', 'C', 'C', 'A', 'A', 'C', 'D', 'A',
        'B', 'B', 'B', 'A', 'C', 'C', 'C', 'B', 'C', 'A', 'B', 'B', 'B',
        'B', 'C', 'A', 'B', 'A', 'C', 'D', 'C', 'A', 'B', 'B', 'C', 'D',
        'B', 'B', 'B', 'A', 'A', 'C', 'C', 'B', 'D', 'C', 'C', 'B', 'B',
        'C', 'B', 'B', '

In [12]:
## make the dataframe and look at first few rows
df = pd.DataFrame(d)
df.head()

Unnamed: 0,student,grade
0,0,C
1,1,C
2,2,C
3,3,D
4,4,A


> just like with dictionaries, pandas just builds the dataframe.  For the most part, who cares, but we can make the order if we so choose.

In [13]:
# make a pandas dataframe
pd.DataFrame(d, columns =['student','grade']).head()


Unnamed: 0,student,grade
0,0,C
1,1,C
2,2,C
3,3,D
4,4,A


## Read files

Pandas can read all sorts of files, let's look what we have we can read with pd.read_<tab>

In [14]:
pd.read

AttributeError: module 'pandas' has no attribute 'read'

### CSV file from the web

In [15]:
## READ IN A CSV FILE FORM THE WEB

# define the url for our dataset
url = "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv"

# read the csv from the web into a pandas dataframe
cars = pd.read_csv(url)

# print out the first few ros
cars.head()

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


### CSV file from local

I am using Google Collab, so I am going to make sure that my dataset is available in my working directory.  

The `diamonds.zip` file is in the Session 4 **datasets** folder.

In [16]:
# run a command to list the files in our directory
%ls

[31m04-strings-booleans.ipynb[m[m*        [31m10-pandas.ipynb[m[m*
[31m05-Tuples-and-Lists.ipynb[m[m*        11-pandas-creating-data.ipynb
[31m06-Collab-File-Access.ipynb[m[m*      12-pandas-selection.ipynb
[31m07-Loops.ipynb[m[m*                   2-example-analysis-project.ipynb
[31m08-Functions.ipynb[m[m*               3-intro-to-python.ipynb
[31m08-Numpy.ipynb[m[m*                   README.md
[31m09-Numpy.ipynb[m[m*                   Untitled.ipynb
[31m09-dictionaries.ipynb[m[m*            [34mdatasets[m[m/
1-getting-started.ipynb           main.py


In [17]:
# define the columns from the file that we want to keep
COLS = ['carat','cut','color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

# replace the diamonds dataset by reading it in again and specifying the columns
diamonds = pd.read_csv("diamonds.zip", usecols=COLS)
diamonds.head(n=10)

FileNotFoundError: [Errno 2] No such file or directory: 'diamonds.zip'

### JSON file on the web

JSON is a file format that looks alot like a dictionary.

Click on the link below to see what the response from the API looks like

https://www.metaweather.com/api/location/search/?query=boston



In [0]:
# bring in the data -- note that below we are using read_json, which tells
# python that this is the format
boston = pd.read_json("https://www.metaweather.com/api/location/search/?query=boston")
boston

In [0]:
# if we wanted to, we COULD convert the pandas dataframe to a dictionary
weather = xloc.to_dict()
weather

#### Quick exercise

Write a command to extract the longitude from the weather dictionary

> Hint:  remember our exercises from week 1 and how we can look for methods to operate on our data

# Explore the Dataframe

In [0]:
# remember shape?
cars.shape

In [0]:
# top and bottom of the dataset
cars.head()


In [0]:
cars.tail()

In [0]:
# sample records
cars.sample()

In [0]:
# a nice summary of the numerical columns
cars.describe()

In [0]:
# column names
cars.columns

> REMEMBER:  Column names act as an index too!

In [0]:
# get the row index
cars.index

In [0]:
# remember, if we wanted to, we can set the index
cars.index = cars['model']
cars.index

In [0]:
cars.head()

> We will learn how to handle the dupe column in a later lesson

In [0]:
# we can also look at the columns to determine if there are missing data
cars.isnull().sum()

> Note the chaining.  We are applying::

- isnull() method
- and then summing up the data of is null by column

> REMEMBER:  Boolean Logicals are also 0/1 for False/True

## Exercise

We have the diamonds dataset also loaded

### Question 1:

Print out a summary of the dataset

### Question 2:
What is the average for carat?

### Question 3:

Your boss asks you to calculate the average price within the dataset.  

Use **two** different approaches to arrive at the answer.

> HINT:  the answer is 3932.799722

In [0]:
# method 1

In [0]:
# method 2