### The Pandas Module
Pandas is a Python module for working with tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as SQL or Excel, but Pandas adds the power of Python.

In [1]:
import pandas as pd

### Create a DataFrame I
A DataFrame is an object that stores data as rows and columns. You can think of a DataFrame as a spreadsheet or as a SQL table. You can manually create a DataFrame or fill it with data from a CSV, an Excel spreadsheet, or a SQL query.

DataFrames have rows and columns. Each column has a name, which is a string. Each row has an index, which is an integer. DataFrames can contain many different data types: strings, ints, floats, tuples, etc.

You can pass in a dictionary to pd.DataFrame. Each key is a column name and each value is a list of column values. The columns must all be the same length or you will get an error. Here’s an example:

In [2]:
df1 = pd.DataFrame({
    'name': ['John Smith', 'Jane Doe', 'Joe Schmo'],
    'address': ['123 Main St.', '456 Maple Ave.', '789 Broadway'],
    'age': [34, 28, 51]
})

You run an online clothing store called Panda's Wardrobe. You need a DataFrame containing information about your products.

In [3]:
df1 = pd.DataFrame({
  'Product ID': [1, 2, 3, 4],
  'Product Name': ['t-shirt', 't-shirt', 'skirt', 'skirt'],
  'Color': ['blue', 'green', 'red', 'black']
})

print(df1)

   Color  Product ID Product Name
0   blue           1      t-shirt
1  green           2      t-shirt
2    red           3        skirt
3  black           4        skirt


### Create a DataFrame II
You can also add data using lists.

For example, you can pass in a list of lists, where each one represents a row of data. Use the keyword argument columns to pass a list of column names.

In [4]:
df2 = pd.DataFrame([
    ['John Smith', '123 Main St.', 34],
    ['Jane Doe', '456 Maple Ave.', 28],
    ['Joe Schmo', '789 Broadway', 51]
    ],
    columns=['name', 'address', 'age'])

You're running a chain of pita shops called Pita Power. You want to create a DataFrame with information on your different store locations.

In [5]:
df2 = pd.DataFrame([
  [1, 'San Diego', 100],
  [2, 'Los Angeles', 120],
  [3, 'San Francisco', 90],
  [4, 'Sacramento', 115]
],
  columns=['Store ID', 'Location', 'Number of Employees'])

print(df2)

   Store ID       Location  Number of Employees
0         1      San Diego                  100
1         2    Los Angeles                  120
2         3  San Francisco                   90
3         4     Sacramento                  115


### Comma Separated Variables (CSV)
We now know how to create our own DataFrame. However, most of the time, we'll be working with datasets that already exist. One of the most common formats for big datasets is the CSV.

CSV (comma separated values) is a text-only spreadsheet format.

### Loading and Saving CSVs
When you have data in a CSV, you can load it into a DataFrame in Pandas using .read_csv():

In [6]:
# pd.read_csv('my-csv-file.csv')

In the example above, the .read_csv() method is called. The CSV file called my-csv-file is passed in as an argument.

We can also save data to a CSV, using .to_csv.

In [7]:
# df.to_csv('new-csv-file.csv')

In the example above, the .to_csv() method is called on df (which represents a DataFrame object). The name of the CSV file is passed in as an argument (new-csv-file.csv). By default, this method will save the CSV file in your current directory.

You're working for the County of Whoville and you just received a CSV of data about the different cities in your county. Read the CSV 'sample.csv' into a variable called df, so that you can learn more about the cities.

Let's inspect the CSV.

Type print(df) on the next line and then run your code. What sort of data were you sent?

In [8]:
df = pd.read_csv('sample.csv')
print(df)

            City  Population  Median Age
0      Maplewood      100000          40
1          Wayne      350000          33
2  Forrest Hills      300000          35
3        Paramus      400000          55
4     Hackensack      290000          39


### Inspect a DataFrame
When we load a new DataFrame from a CSV, we want to know what it looks like.

If it's a small DataFrame, you can display it by typing print(df).

If it's a larger DataFrame, it's helpful to be able to inspect a few items without having to look at the entire DataFrame.

The method .head() gives the first 5 rows of a DataFrame. If you want to see more rows, you can pass in the positional argument n. For example, df.head(10) would show the first 10 rows.

The method df.info() gives some statistics for each column.

ou're working for a Hollywood studio, trying to use data to predict the next big hit. Load the CSV imdb.csv into a variable called df, so that you can learn about popular movies from the past 90 years.

In [9]:
df = pd.read_csv('imdb.csv')

In [10]:
print(df.head(1))

   id    name   genre  year  imdb_rating
0   1  Avatar  action  2009          7.9


In [11]:
print(df.head())

   id                                       name   genre  year  imdb_rating
0   1                                     Avatar  action  2009          7.9
1   2                             Jurassic World  action  2015          7.3
2   3                               The Avengers  action  2012          8.1
3   4                            The Dark Knight  action  2008          9.0
4   5  Star Wars: Episode I - The Phantom Menace  action  1999          6.6


### Select Columns
Now we know how to create and load data. Let's select parts of those datasets that are interesting or important to our analyses.

Perhaps you want to take the average or plot a histogram of the ages. In order to do either of these tasks, you'd need to select the column.

There are two possible syntaxes for selecting all values from a column.

    Select the column as if you were selecting a value from a dictionary using a key. In our example, we would type customers['age'] to select the ages.
    If the name of a column follows all of the rules for a variable name (doesn't start with a number, doesn't contain spaces or special characters, etc.), then you can select it using the following notation: df.MySecondColumn. In our example, we would type customers.age.
    
The DataFrame df represents data collected by four health clinics run by the same organization. Each row represents a month from January through June and shows the number of appointments made at four different clinics.

You want to analyze what's been happening at the North location. Create a variable called clinic_north that contains ONLY the data from the column clinic_north.

In [12]:
df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north = df['clinic_north']
print(type(clinic_north))
print(type(df))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


### Selecting Multiple Columns
When you have a larger DataFrame, you might want to select just a few columns.

To select two or more columns from a DataFrame, we use a list of the column names. To create the DataFrame shown above, we would use:

In [13]:
# new_df = orders[['last_name', 'email']]

Now, you want to compare visits to the Northern and Southern clinics.

Create a variable called clinic_north_south that contains ONLY the data from the columns clinic_north and clinic_south.

In [14]:
df = pd.DataFrame([
  ['January', 100, 100, 23, 100],
  ['February', 51, 45, 145, 45],
  ['March', 81, 96, 65, 96],
  ['April', 80, 80, 54, 180],
  ['May', 51, 54, 54, 154],
  ['June', 112, 109, 79, 129]],
  columns=['month', 'clinic_east',
           'clinic_north', 'clinic_south',
           'clinic_west']
)

clinic_north_south = df[ ['clinic_north', 'clinic_south'] ]
print(type(clinic_north_south))

<class 'pandas.core.frame.DataFrame'>


You're getting ready to staff the clinic for March this year. You want to know how many visits took place in March last year, to help you prepare.

Write a command that will produce a Series made up of the March data from df from all four clinic sites and save it to the variable march.

In [15]:
# Remember that DataFrames are zero-indexed. The first row is accessed using df.iloc[0].

march = df.iloc[2]
print(march)

month           March
clinic_east        81
clinic_north       96
clinic_south       65
clinic_west        96
Name: 2, dtype: object


### Selecting Multiple Rows
You can also select multiple rows from a DataFrame.

One of your doctors thinks that there are more clinic visits in the late Spring.

Write a command that will produce a DataFrame made up of the data for April, May, and June from df for all four sites (rows 3 through 6), and save it to april_may_june.

In [16]:
april_may_june = df.iloc[3:7]
print(april_may_june)

   month  clinic_east  clinic_north  clinic_south  clinic_west
3  April           80            80            54          180
4    May           51            54            54          154
5   June          112           109            79          129


### Select Rows with Logic I
You can select a subset of a DataFrame by using logical statements:

In [17]:
# df[df.MyColumnName == desired_column_value]

You're going to staff the clinic for January of this year. You want to know how many visits took place in January of last year, to help you prepare.

Create variable january using a logical statement that selects the row of df where the 'month' column is 'January'.

In [18]:
january = df[df.month == 'January']
print(january)

     month  clinic_east  clinic_north  clinic_south  clinic_west
0  January          100           100            23          100


Select Rows with Logic II
You can also combine multiple logical statements, as long as each statement is in parentheses.

You want to see how the number of clinic visits changed between March and April.

Create the variable march_april, which contains the data from March and April. Do this using two logical statements combined using |, which means "or".

In [19]:
march_april = df[ (df.month == 'March') | (df.month == 'April') ]
print(march_april)

   month  clinic_east  clinic_north  clinic_south  clinic_west
2  March           81            96            65           96
3  April           80            80            54          180


### Select Rows with Logic III
Suppose we want to select the rows where the customer's name is either "Martha Jones", "Rose Tyler" or "Amy Pond".

In [20]:
#df[df.name.isin(['Martha Jones',
#     'Rose Tyler',
#     'Amy Pond'])]

Another doctor thinks that you have a lot of clinic visits in the late Winter.

Create the variable january_february_march, containing the data from January, February, and March. Do this using a single logical statement with the isin command.

In [21]:
january_february_march = df[df.month.isin(['January', 'February', 'March'])]
print(january_february_march)

      month  clinic_east  clinic_north  clinic_south  clinic_west
0   January          100           100            23          100
1  February           51            45           145           45
2     March           81            96            65           96


### Setting indices
When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. This is inelegant and makes it hard to use .iloc.

We can fix this using the method .reset_index().

Note that the old indices have been moved into a new column called 'index'. Unless you need those values for something special, it's probably better to use the keyword drop=True so that you don't end up with that extra column.

Examine the code in the workspace. Note that df2 is a subset of rows from df.

In [22]:
df2 = df.loc[[1, 3, 5]]
print(df2)

      month  clinic_east  clinic_north  clinic_south  clinic_west
1  February           51            45           145           45
3     April           80            80            54          180
5      June          112           109            79          129


Create a new DataFrame called df3 by resetting the indices on df2 (don't use inplace or drop). Did df2 change after you ran this command?

Reset the indices of df2 by using the keyword inplace=True and drop=True. Did the indices of df2 change? How is df2 different from df3?

In [23]:
df3 = df2.reset_index()
df2.reset_index(inplace=True, drop=True)
print(df3)
print(df2)

   index     month  clinic_east  clinic_north  clinic_south  clinic_west
0      1  February           51            45           145           45
1      3     April           80            80            54          180
2      5      June          112           109            79          129
      month  clinic_east  clinic_north  clinic_south  clinic_west
0  February           51            45           145           45
1     April           80            80            54          180
2      June          112           109            79          129


### Review
You've completed the lesson! You've just learned the basics of working with a single table in Pandas, including:

    Create a table from scratch
    Loading data from another file
    Selecting certain rows or columns of a table
Let's practice what you've learned.

In this example, you'll be the data analyst for ShoeFly.com, a fictional online shoe store. You've seen this data; now it's your turn to work with it!

Load the data from shoefly.csv into the variable orders.

In [24]:
orders = pd.read_csv('shoefly.csv')

Inspect the first 5 lines of the data.

In [25]:
print(orders[:6])

      id first_name last_name                         email     shoe_type  \
0  54791    Rebecca   Lindsay  RebeccaLindsay57@hotmail.com         clogs   
1  53450      Emily     Joyce        EmilyJoyce25@gmail.com  ballet flats   
2  91987      Joyce    Waller        Joyce.Waller@gmail.com       sandals   
3  14437     Justin  Erickson   Justin.Erickson@outlook.com         clogs   
4  79357     Andrew     Banks              AB4318@gmail.com         boots   
5  52386      Julie     Marsh        JulieMarsh59@gmail.com       sandals   

  shoe_material shoe_color  
0  faux-leather      black  
1  faux-leather       navy  
2        fabric      black  
3  faux-leather        red  
4       leather      brown  
5        fabric      black  


Your marketing department wants to send out an email blast to everyone who ordered shoes!

Select all of the email addresses from the column email and save them to a variable called emails.

In [26]:
emails = orders['email']

Frances Palmer claims that her order was wrong. What did Frances Palmer order?

Use logic to select that row of orders and save it to the variable frances_palmer.

In [27]:
frances_palmer = orders[orders.first_name == 'Frances']
print(frances_palmer)

      id first_name last_name                      email shoe_type  \
9  62083    Frances    Palmer  FrancesPalmer50@gmail.com    wedges   

  shoe_material shoe_color  
9       leather      white  


We need some customer reviews for our comfortable shoes. Select all orders for shoe_type: clogs, boots, and ballet flats and save them to the variable comfy_shoes.

In [28]:
comfy_shoes = orders[orders.shoe_type.isin(['clogs', 'boots', 'ballet flats'])]
print(comfy_shoes)

       id first_name   last_name                         email     shoe_type  \
0   54791    Rebecca     Lindsay  RebeccaLindsay57@hotmail.com         clogs   
1   53450      Emily       Joyce        EmilyJoyce25@gmail.com  ballet flats   
3   14437     Justin    Erickson   Justin.Erickson@outlook.com         clogs   
4   79357     Andrew       Banks              AB4318@gmail.com         boots   
6   20487     Thomas      Jensen              TJ5470@gmail.com         clogs   
7   76971     Janice       Hicks        Janice.Hicks@gmail.com         clogs   
8   21586    Gabriel      Porter     GabrielPorter24@gmail.com         clogs   
10  91629    Jessica        Hale       JessicaHale25@gmail.com         clogs   
12  45832      Susan      Dennis       SusanDennis58@gmail.com  ballet flats   
14  73431    Rebecca     Charles     Rebecca.Charles@gmail.com         boots   
16  39888    Vincent  Stephenson            VS4753@outlook.com         boots   
17  35961        Roy     Tillman        