## Data Manipulation with pandas




# Course Description

pandas is the world's most popular Python library, used for everything from data manipulation to data analysis. In this course, you'll learn how to manipulate DataFrames, as you extract, filter, and transform real-world datasets for analysis. Using pandas you’ll explore all the core data science concepts. Using real-world data, including Walmart sales figures and global temperature time series, you’ll learn how to import, clean, calculate statistics, and create visualizations—using pandas to add to the power of Python!

##  Transforming DataFrames
Free
0%

Let’s master the pandas basics. Learn how to inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns.

    Introducing DataFrames    50 xp
    Inspecting a DataFrame    100 xp
    Parts of a DataFrame    100 xp
    Sorting and subsetting    50 xp
    Sorting rows    100 xp
    Subsetting columns    100 xp
    Subsetting rows    100 xp
    Subsetting rows by categorical variables    100 xp
    New columns    50 xp
    Adding new columns    100 xp
    Combo-attack!    100 xp


##  Aggregating DataFrames
0%

In this chapter, you’ll calculate summary statistics on DataFrame columns, and master grouped summary statistics and pivot tables.

    Summary statistics    50 xp
    Mean and median    100 xp
    Summarizing dates    100 xp
    Efficient summaries    100 xp
    Cumulative statistics    100 xp
    Counting    50 xp
    Dropping duplicates    100 xp
    Counting categorical variables    100 xp
    Grouped summary statistics    50 xp
    What percent of sales occurred at each store type?    100 xp
    Calculations with .groupby()    100 xp
    Multiple grouped summaries    100 xp
    Pivot tables    50 xp
    Pivoting on one variable    100 xp
    Fill in missing values and sum values with pivot tables    100 xp


##  Slicing and Indexing DataFrames
0%

Indexes are supercharged row and column names. Learn how they can be combined with slicing for powerful DataFrame subsetting.

    Explicit indexes    50 xp
    Setting and removing indexes    100 xp
    Subsetting with .loc[]    100 xp
    Setting multi-level indexes    100 xp
    Sorting by index values    100 xp
    Slicing and subsetting with .loc and .iloc    50 xp
    Slicing index values    100 xp
    Slicing in both directions    100 xp
    Slicing time series    100 xp
    Subsetting by row/column number    100 xp
    Working with pivot tables    50 xp
    Pivot temperature by city and year    100 xp
    Subsetting pivot tables    100 xp
    Calculating on a pivot table    100 xp


##  Creating and Visualizing DataFrames
0%

Learn to visualize the contents of your DataFrames, handle missing data values, and import data from and export data to CSV files.

    Visualizing your data    50 xp
    Which avocado size is most popular?    100 xp
    Changes in sales over time    100 xp
    Avocado supply and demand    100 xp
    Price of conventional vs. organic avocados    100 xp
    Missing values    50 xp
    Finding missing values    100 xp
    Removing missing values    100 xp
    Replacing missing values    100 xp
    Creating DataFrames    50 xp
    List of dictionaries    100 xp
    Dictionary of lists    100 xp
    Reading and writing CSVs    50 xp
    CSV to DataFrame    100 xp
    DataFrame to CSV    100 xp
    Wrap-up    50 xp 
    

## Introducing DataFrames



# *******************************************************************************************************************
**Pandas is a Python package for data manipulation.  It can also be used for data visulation, which will be introduced in chapter4.  We'll start be talking about DataFrames, which form the core of Pandas.  In chapter2, we'll discuss aggregating data to gather insights.  In chapter3, we'll learn all about slicing and indexing to subset DataFrames.  Finally, we'll visualize our data, deal with missing data, and read data into a DataFrame.  


# *******************************************************************************************************************
Pandas is built on top of two essential Python packages (I thought just based on NumPy): NumPy and Matplotlib.  NumPy provides multidimensional array objects for easy data manipulation that Pandas uses to store data, and Matplotlib has powerful data visualization capabilities that Pandas takes advantage of.  

# Almost the entire Python data science community using Pandas.  There are several ways to store data for analysis, but rectangular data, sometimes called tabular data is the most cmmon form.  

In below example with dogs, each observation or each dog is a row, and each variable or each dog property is a column.  Pandas is designed to work with rectangular data like this.  In Pandas, rectangular data is represented as a DataFrame object.  Each programming language used for data analysis has something similar to this.  R also has DataFrames, while SQL has database tables.  Every value within a column has the same data type, either text or numeric, but different columns can contain different data types.  


**When you first receive a new dataset, you want to quickly explore it and get a sense of its contents. 
# *******************************************************************************************************************
Pandas provides several methods for this, the first is .head(), which returns the first few rows of the DataFrame.  With this method, we only had several rows to begain with, not super exciting, but it bocomes very useful if you have many rows.  
The .info() method displays the names of columns, the data types they contain, and whether they have any missing values.  
A DataFrame's .shape attribute contains a tuple that holds the number of rows followed by the number of columns.  Since this is an attribute instead of a method, you write it without parentheses.  
The .describe() method computes some summary statistics for numerical columns, like mean and median.  the count is the number of non-missing values in each column.  Describe is good for a quick overview of numeric variables, but if you want more control, you'll see how to perform more specific calculations later in the course.  

# *******************************************************************************************************************
# DataFrame consist of 3 different components, accessible using attributes.  .values, .columns, .index
The .values attribute as you might expect, contains the data values in a 2-dimensional NumPy array.  The other two components of a DataFrame are labels for columns and rows.  The .columns attributes contains column names, and the .index attribute contains row numbers or row names.  Be careful, since row labels are stored in .index not in .rows.  Notice that these are index objects, which we'll cover in chapter3.  This allows for flexibility in labels.  For example, the dogs data uses row numbers, but row names are also possible.  


Python has a semi-official philosophy on how to write good code called The Zen of Python.  One suggestion is that given aprogramming problem, there should be one obvious solution.  As you go through this course, bear in mind that Pandas deliberately doesnt follow this philosopyh.  Instead there are often multiple ways to solve a problem, leaving you to choose the best.  

In this respect, Pandas is like a Swiss Army Knife, giving you a variety of tools, making it increibly powerful, but more difficult to learn.  

In this course, we aim for a more streamlined approach to Pandas, only covering the most important ways of doing things.  


---------------------------------------------------------------------------------
Name      | Breed       | Color   | Height(cm)  | Weight(kg)  | Dataof Birth
Bella     | Labrador    | Brown   | 56          | 25          | 2013-07-01
Charlie   | Poodle      | Black   | 43          | 23          | 2016-09-16
Lucy      | Chow Chow   | Brown   | 46          | 22          | 2014-08-25
Cooper    | Schnauzer   | Gray    | 49          | 17          | 2011-12-11
Max       | Labrador    | Black   | 59          | 29          | 2017-01-20
Stella    | Chihuahua   | Tan     | 18          | 2           | 2015-04-20
Bernie    | St.Bernard  | White   | 77          | 74          | 2018-02-27




In [9]:
import pandas as pd


df = pd.read_csv('dogs.csv')

print(df)
print()


print(df.info())
print()


print(df.describe())
print()


print(df.columns)
print(df.index)

      Name       Breed  Color  Height(cm)  Weight(kg) Dataof Birth
0    Bella    Labrador  Brown          56          25   2013-07-01
1  Charlie      Poodle  Black          43          23   2016-09-16
2     Lucy   Chow Chow  Brown          46          22   2014-08-25
3   Cooper   Schnauzer   Gray          49          17   2011-12-11
4      Max    Labrador  Black          59          29   2017-01-20
5   Stella   Chihuahua    Tan          18           2   2015-04-20
6   Bernie  St.Bernard  White          77          74   2018-02-27

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Name          7 non-null      object
 1   Breed         7 non-null      object
 2   Color         7 non-null      object
 3   Height(cm)    7 non-null      int64 
 4   Weight(kg)    7 non-null      int64 
 5   Dataof Birth  7 non-null      object
dtypes: int64(2), object(4)

## Inspecting a DataFrame

# *******************************************************************************************************************
# When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

    .head() returns the first few rows (the “head” of the DataFrame).
    .info() shows information on each of the columns, such as the data type and number of missing values.
    .shape returns the number of rows and columns of the DataFrame.
    .describe() calculates a few summary statistics for each column.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

pandas is imported for you.
Instructions 1/4
25 XP

    Question 1
    Print the head of the homelessness DataFrame.
    
    
    
    Question 2
    Print information about the column types and missing values in homelessness.
    
    
    
    Question 3
    Print the number of rows and columns in homelessness.
    
    
    
    Question 4
    
    Print some summary statistics that describe the homelessness DataFrame.


In [13]:
import pandas as pd


df = pd.read_csv('homelessness.csv')

print(df.head())
print(df.info())
print(df.shape)
print(df.describe())

             region       state  individuals  family_members  state_pop
0  EastSouthCentral     Alabama       2570.0           864.0    4887681
1           Pacific      Alaska       1434.0           582.0     735139
2          Mountain     Arizona       7259.0          2606.0    7158024
3  WestSouthCentral    Arkansas       2280.0           432.0    3009733
4           Pacific  California     109008.0         20964.0   39461588
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.1+ KB
None
(51, 5)
         individuals  family_members     state_pop
count      51.000000     

## Parts of a DataFrame

# *******************************************************************************************************************
# To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

    .values: A two-dimensional NumPy array of values.
    .columns: An index of columns: the column names.
    .index: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)

homelessness is available.
Instructions
100 XP

    Import pandas using the alias pd.
    Print a 2D NumPy array of the values in homelessness.
    Print the column names of homelessness.
    Print the index of homelessness.


In [30]:
df = pd.read_csv('homelessness.csv')


print(df.values)
print(type(df.values))

print(df.columns)
print(df.index)

[['EastSouthCentral' 'Alabama' 2570.0 864.0 4887681]
 ['Pacific' 'Alaska' 1434.0 582.0 735139]
 ['Mountain' 'Arizona' 7259.0 2606.0 7158024]
 ['WestSouthCentral' 'Arkansas' 2280.0 432.0 3009733]
 ['Pacific' 'California' 109008.0 20964.0 39461588]
 ['Mountain' 'Colorado' 7607.0 3250.0 5691287]
 ['NewEngland' 'Connecticut' 2280.0 1696.0 3571520]
 ['SouthAtlantic' 'Delaware' 708.0 374.0 965479]
 ['SouthAtlantic' 'DistrictofColumbia' 3770.0 3134.0 701547]
 ['SouthAtlantic' 'Florida' 21443.0 9587.0 21244317]
 ['SouthAtlantic' 'Georgia' 6943.0 2556.0 10511131]
 ['Pacific' 'Hawaii' 4131.0 2399.0 1420593]
 ['Mountain' 'Idaho' 1297.0 715.0 1750536]
 ['EastNorthCentral' 'Illinois' 6752.0 3891.0 12723071]
 ['EastNorthCentral' 'Indiana' 3776.0 1482.0 6695497]
 ['WestNorthCentral' 'Iowa' 1711.0 1038.0 3148618]
 ['WestNorthCentral' 'Kansas' 1443.0 773.0 2911359]
 ['EastSouthCentral' 'Kentucky' 2735.0 953.0 4461153]
 ['WestSouthCentral' 'Louisiana' 2540.0 519.0 4659690]
 ['NewEngland' 'Maine' 1450.0 

## Sorting and subsetting



# *******************************************************************************************************************
**In this video, we'll cover 2 simple stand possibly most important ways to find interesting parts of your DataFrame.  The first thing you can do is change the order of rows by sorting them so that the most interesting data is at the top of the DataFrame.  



**You can sort rows using the .sort_values() method, passing in a column name that you want to sort by.  Setting the ascending=False argument to False will sort the data the other way around.
# *******************************************************************************************************************
We can also sort by multiple variables by passing a list of column names to .sort_values() method.  To change the direction values are sorted in, pass a list to the ascending argument to specify which direction sorting should be done for each variable.  


In [111]:
df = pd.read_csv('homelessness.csv')


print(df.sort_values('state_pop').head())

print(df.sort_values('family_members', ascending=False).head())

              region               state  individuals  family_members  \
50          Mountain             Wyoming        434.0           205.0   
45        NewEngland             Vermont        780.0           511.0   
8      SouthAtlantic  DistrictofColumbia       3770.0          3134.0   
1            Pacific              Alaska       1434.0           582.0   
34  WestNorthCentral         NorthDakota        467.0            75.0   

    state_pop  
50     577601  
45     624358  
8      701547  
1      735139  
34     758080  
              region          state  individuals  family_members  state_pop
32      Mid-Atlantic        NewYork      39827.0         52070.0   19530351
4            Pacific     California     109008.0         20964.0   39461588
21        NewEngland  Massachusetts       6811.0         13257.0    6882635
9      SouthAtlantic        Florida      21443.0          9587.0   21244317
43  WestSouthCentral          Texas      19199.0          6111.0   28628666


In [112]:
df = pd.read_csv('dogs2.csv')


print(df.head())

#####################################################################################################################
print(df.sort_values(['weight_km', 'height_cn'], ascending=[False, True]))

      name      breed  color  height_cn  weight_km data_of_birth
0    Bella   Labrador  Brown         56         24    2013-07-01
1  Charlie     Poodle  Black         43         24    2016-09-16
2     Lucy  Chow Chow  Brown         46         24    2014-08-25
3   Cooper  Schnauzer   Gray         49         17    2011-12-11
4      Max   Labrador  Black         59         29    2017-01-20
      name       breed  color  height_cn  weight_km data_of_birth
6   Bernie  St.Bernard  White         77         74    2018-02-27
4      Max    Labrador  Black         59         29    2017-01-20
1  Charlie      Poodle  Black         43         24    2016-09-16
2     Lucy   Chow Chow  Brown         46         24    2014-08-25
0    Bella    Labrador  Brown         56         24    2013-07-01
3   Cooper   Schnauzer   Gray         49         17    2011-12-11
5   Stella   Chihuahua    Tan         18          2    2015-04-20



# We may want to zoom in on just one column.  We can do this using the name of the DataFrame, followed by square brackets with a column name inside.  


To select multiple columns you need 2 pairs of square brackets.  In this condition, the inner and the outer square brackets are performing different tasks.  

* The outer square brackets are responsible for subsetting the DataFrame, and the inner square brackets are creating a list of column names to subset.  This means you could provide a separate list of column names as a variable and then use that list to perform the same subsetting.  Usually, its easier to do in one line.  



In [40]:
print(df['name'])


print(df[['name', 'breed', 'color']])


#####################################################################################################################
what_i_need = ['name', 'breed', 'color', 'height_cn']

print(df[what_i_need])

0      Bella
1    Charlie
2       Lucy
3     Cooper
4        Max
5     Stella
6     Bernie
Name: name, dtype: object
      name       breed  color
0    Bella    Labrador  Brown
1  Charlie      Poodle  Black
2     Lucy   Chow Chow  Brown
3   Cooper   Schnauzer   Gray
4      Max    Labrador  Black
5   Stella   Chihuahua    Tan
6   Bernie  St.Bernard  White
      name       breed  color  height_cn
0    Bella    Labrador  Brown         56
1  Charlie      Poodle  Black         43
2     Lucy   Chow Chow  Brown         46
3   Cooper   Schnauzer   Gray         49
4      Max    Labrador  Black         59
5   Stella   Chihuahua    Tan         18
6   Bernie  St.Bernard  White         77


# *******************************************************************************************************************
# There are lots of different ways to subset rows, the most common way to do this is by creating a logical condition to filter against.  
For example, lets find all the dogs whose height is greater than 50 centimeters.  Now we have a True or False value for every row.  We can use logical condition inside of square brackets to subset the rows we are interested in to get all of the dogs taller than 50 centimeters.  

**We can also subset rows based on text data.  Here we use double equal sign in the logical condition to filter the dogs that are Labradors.  

**We can also subset based on dates.  Here we filter all the dogs born before 2015.  Notice thdates are in quotes and are written as year-month-day, this is international standard date format.  


# *******************************************************************************************************************
**To subset the rows that meet multiple conditions, you can combine contitions using logical operators, such as and operator seen here.  This means that only rows that meet both of these conditions will be subsetted.  You can also do this in one line of code, but you'll also need to add parentheses around each condition.  



# *******************************************************************************************************************
# *******************************************************************************************************************
**If you want to filter on multiple values of a categorical variable, the easiest way is to use .isin() method.  This method takes in a list of values to filter for.  


In [42]:
above_50 = df['height_cn']>50

print(above_50)


print(df[above_50])

0     True
1    False
2    False
3    False
4     True
5    False
6     True
Name: height_cn, dtype: bool
     name       breed  color  height_cn  weight_km data_of_birth
0   Bella    Labrador  Brown         56         24    2013-07-01
4     Max    Labrador  Black         59         29    2017-01-20
6  Bernie  St.Bernard  White         77         74    2018-02-27


In [43]:
labrador = df['breed'] == 'Labrador'


print(df[labrador])

    name     breed  color  height_cn  weight_km data_of_birth
0  Bella  Labrador  Brown         56         24    2013-07-01
4    Max  Labrador  Black         59         29    2017-01-20


In [50]:
print(type(df['data_of_birth'][0]))

<class 'str'>


# *******************************************************************************************************************
Important question, can we change the column data types? like here from string to data? or datetime


In [53]:
after_2015 = df['data_of_birth'] > '2015'


print(df[before_2015])

      name       breed  color  height_cn  weight_km data_of_birth
1  Charlie      Poodle  Black         43         24    2016-09-16
4      Max    Labrador  Black         59         29    2017-01-20
5   Stella   Chihuahua    Tan         18          2    2015-04-20
6   Bernie  St.Bernard  White         77         74    2018-02-27


In [83]:
after_2015 = df['data_of_birth'] > '2015'
black = df['color'] == 'Black'


                                  ###################################################################################
print(df[after_2015 & black])     # Why we cant replace & with and ???
# This is because ‘and’ tests whether both expressions are logically True 
# while ‘&’ performs bitwise AND operation on the result of both statements.

print()
print(df[after_2015 + black])


print()
print(df[after_2015 | black])

print()
#####################################################################################################################
print(df[(df['data_of_birth']>'2015') & (df['color']=='Black')])

      name       breed  color  height_cn  weight_km data_of_birth
1  Charlie      Poodle  Black         43         24    2016-09-16
4      Max    Labrador  Black         59         29    2017-01-20
5   Stella   Chihuahua  Black         18          2    2015-04-20
6   Bernie  St.Bernard  Black         77         74    2018-02-27

      name       breed  color  height_cn  weight_km data_of_birth
0    Bella    Labrador  Black         56         24    2013-07-01
1  Charlie      Poodle  Black         43         24    2016-09-16
2     Lucy   Chow Chow  Black         46         24    2014-08-25
3   Cooper   Schnauzer  Black         49         17    2011-12-11
4      Max    Labrador  Black         59         29    2017-01-20
5   Stella   Chihuahua  Black         18          2    2015-04-20
6   Bernie  St.Bernard  Black         77         74    2018-02-27

      name       breed  color  height_cn  weight_km data_of_birth
0    Bella    Labrador  Black         56         24    2013-07-01
1  Charl

In [115]:
black_brown = df['color'].isin(['Black', 'Brown'])
#####################################################################################################################


print(df[black_brown])


black_brown2 = df[(df['color']=='Black') | (df['color']=='Brown')]   # Using or operator
print(black_brown2)

      name      breed  color  height_cn  weight_km data_of_birth
0    Bella   Labrador  Brown         56         24    2013-07-01
1  Charlie     Poodle  Black         43         24    2016-09-16
2     Lucy  Chow Chow  Brown         46         24    2014-08-25
4      Max   Labrador  Black         59         29    2017-01-20
      name      breed  color  height_cn  weight_km data_of_birth
0    Bella   Labrador  Brown         56         24    2013-07-01
1  Charlie     Poodle  Black         43         24    2016-09-16
2     Lucy  Chow Chow  Brown         46         24    2014-08-25
4      Max   Labrador  Black         59         29    2017-01-20


## Sorting rows

# Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.
Sort on … 	Syntax
one column 	df.sort_values("breed")
multiple columns 	df.sort_values(["breed", "weight_kg"])

By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

homelessness is available and pandas is loaded as pd.
Instructions 1/3
35 XP

    Question 1
    Sort homelessness by the number of homeless individuals, from smallest to largest, and save this as homelessness_ind.
    Print the head of the sorted DataFrame.
    
    
    
    Question 2
    Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.
    Print the head of the sorted DataFrame.
    
    
    
    Question 3
#    Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.
    Print the head of the sorted DataFrame.


In [99]:
import pandas as pd


homelessness = pd.read_csv('homelessness.csv')

# Sort homelessness by individual
homelessness_ind = homelessness.sort_values('individuals')  #.ascending=False

# Print the top few rows
print(homelessness_ind.head())

              region        state  individuals  family_members  state_pop
50          Mountain      Wyoming        434.0           205.0     577601
34  WestNorthCentral  NorthDakota        467.0            75.0     758080
7      SouthAtlantic     Delaware        708.0           374.0     965479
39        NewEngland  RhodeIsland        747.0           354.0    1058287
45        NewEngland      Vermont        780.0           511.0     624358


In [100]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values('family_members', ascending=False)

# Print the top few rows
print(homelessness_fam.head())

              region          state  individuals  family_members  state_pop
32      Mid-Atlantic        NewYork      39827.0         52070.0   19530351
4            Pacific     California     109008.0         20964.0   39461588
21        NewEngland  Massachusetts       6811.0         13257.0    6882635
9      SouthAtlantic        Florida      21443.0          9587.0   21244317
43  WestSouthCentral          Texas      19199.0          6111.0   28628666


In [107]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])
#####################################################################################################################


# Print the top few rows
print(homelessness_reg_fam.head())

              region      state  individuals  family_members  state_pop
13  EastNorthCentral   Illinois       6752.0          3891.0   12723071
35  EastNorthCentral       Ohio       6929.0          3320.0   11676341
22  EastNorthCentral   Michigan       5209.0          3142.0    9984072
49  EastNorthCentral  Wisconsin       2740.0          2167.0    5807406
14  EastNorthCentral    Indiana       3776.0          1482.0    6695497


## Subsetting columns

When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame df, use

df["col_a"]

To select "col_a" and "col_b" of df, use

df[["col_a", "col_b"]]

homelessness is available and pandas is loaded as pd.
Instructions 1/3
35 XP

    Question 1
    Create a DataFrame called individuals that contains only the individuals column of homelessness.
    Print the head of the result.
    
    
    
    Question 2
    Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
    Print the head of the result.
    
    
    
    Question 3
    Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.
    Print the head of the result.


In [108]:
# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result
print(individuals.head())

0      2570.0
1      1434.0
2      7259.0
3      2280.0
4    109008.0
Name: individuals, dtype: float64


In [109]:
# Select the state and family_members columns
state_fam = homelessness[['state', 'family_members']]

# Print the head of the result
print(state_fam.head())

        state  family_members
0     Alabama           864.0
1      Alaska           582.0
2     Arizona          2606.0
3    Arkansas           432.0
4  California         20964.0


In [110]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[['individuals', 'state']]

# Print the head of the result
print(ind_state.head())

   individuals       state
0       2570.0     Alabama
1       1434.0      Alaska
2       7259.0     Arizona
3       2280.0    Arkansas
4     109008.0  California


## Subsetting rows

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]
dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

homelessness is available and pandas is loaded as pd.
Instructions 1/3
35 XP

    Question 1
    Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.
    
    
    
    Question 2
    Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.
    
    
    
    Question 3
    Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the printed result.
    

In [117]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness['individuals']>10000]

# See the result
print(ind_gt_10k)

              region       state  individuals  family_members  state_pop
4            Pacific  California     109008.0         20964.0   39461588
9      SouthAtlantic     Florida      21443.0          9587.0   21244317
32      Mid-Atlantic     NewYork      39827.0         52070.0   19530351
37           Pacific      Oregon      11139.0          3337.0    4181886
43  WestSouthCentral       Texas      19199.0          6111.0   28628666
47           Pacific  Washington      16424.0          5880.0    7523869


In [118]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region']=='Mountain']

# See the result
print(mountain_reg.head())

      region     state  individuals  family_members  state_pop
2   Mountain   Arizona       7259.0          2606.0    7158024
5   Mountain  Colorado       7607.0          3250.0    5691287
12  Mountain     Idaho       1297.0           715.0    1750536
26  Mountain   Montana        983.0           422.0    1060665
28  Mountain    Nevada       7058.0           486.0    3027341


In [119]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members']<1000) & (homelessness['region']=='Pacific')]

# See the result
print(fam_lt_1k_pac)

    region   state  individuals  family_members  state_pop
1  Pacific  Alaska       1434.0           582.0     735139


## Subsetting rows by categorical variables

Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]

homelessness is available and pandas is loaded as pd.
Instructions 1/2
50 XP

    Question 1
    Filter homelessness for cases where the USA census region is "South Atlantic" or it is "Mid-Atlantic", assigning to south_mid_atlantic. View the printed result.
    
    
    
    Question 2
    Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.
    

In [122]:
#print(homelessness)


# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness['region'].isin(['SouthAtlantic', 'Mid-Atlantic'])]

# See the result
print(south_mid_atlantic)

           region               state  individuals  family_members  state_pop
7   SouthAtlantic            Delaware        708.0           374.0     965479
8   SouthAtlantic  DistrictofColumbia       3770.0          3134.0     701547
9   SouthAtlantic             Florida      21443.0          9587.0   21244317
10  SouthAtlantic             Georgia       6943.0          2556.0   10511131
20  SouthAtlantic            Maryland       4914.0          2230.0    6035802
30   Mid-Atlantic           NewJersey       6048.0          3350.0    8886025
32   Mid-Atlantic             NewYork      39827.0         52070.0   19530351
33  SouthAtlantic       NorthCarolina       6451.0          2817.0   10381615
38   Mid-Atlantic        Pennsylvania       8163.0          5349.0   12800922
40  SouthAtlantic       SouthCarolina       3082.0           851.0    5084156
46  SouthAtlantic            Virginia       3928.0          2047.0    8501286
48  SouthAtlantic        WestVirginia       1021.0           222

In [123]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]

# See the result
print(mojave_homelessness)

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
4    Pacific  California     109008.0         20964.0   39461588
28  Mountain      Nevada       7058.0           486.0    3027341
44  Mountain        Utah       1904.0           972.0    3153550


## New columns



# *******************************************************************************************************************
**In the last lesson, you saw how to subset and sort a DataFrame to extract interesting bits.  However, often when you first receive a DataFrame, the contents arent exactly what you want.  You may have to add new columns derived from existing columns.  Creating and adding new columns can go by many names, including mutating a DataFrame, transforming a DataFrame, and frature engineering.  

Lets say we want to add a new column to our DataFrame that has each dog's height in meters instead of centimeters.  On the left-hand side of the equal sign, we use square brackets with the name of the new column we want to create.  On the right-hand side, we have the calculation.  Notice that both the existing column and the new column we just created are in the DataFrame.  
Lets see what the results are if we calculate the body mass index (BMI), of those dogs.  BMI = Wieght in kg/(Height in m)^2.  


# *******************************************************************************************************************
# The real power of Pandas comes in when you combine all the skills you've learned so far.  

Lets figure out the names of skinny, tall dogs.  First to define the skinny dogs, we take the subset of the dogs who have BMI under 100.  Next we sort the results in descending order of heightto get the tallest skinny dogs at the top.  Finally we keep only the columns we'vre interested in.  Here you can see that Max is the tallest dog with a BMI of under 100.  


In [156]:
import pandas as pd

df = pd.read_csv('dogs2.csv')
#print(df)


#####################################################################################################################
df['height_m'] = df['height_cn']/100
df['BMI'] = df['weight_km']/df['height_m']**2


print(df)


bmi_under_100 = df[df['BMI']<100]
#####################################################################################################################
print(bmi_under_100.sort_values('height_m', ascending=False)[['name', 'breed', 'color', 'height_m', 'BMI']])

      name       breed  color  height_cn  weight_km data_of_birth  height_m  \
0    Bella    Labrador  Brown         56         24    2013-07-01      0.56   
1  Charlie      Poodle  Black         43         24    2016-09-16      0.43   
2     Lucy   Chow Chow  Brown         46         24    2014-08-25      0.46   
3   Cooper   Schnauzer   Gray         49         17    2011-12-11      0.49   
4      Max    Labrador  Black         59         29    2017-01-20      0.59   
5   Stella   Chihuahua    Tan         18          2    2015-04-20      0.18   
6   Bernie  St.Bernard  White         77         74    2018-02-27      0.77   

          BMI  
0   76.530612  
1  129.799892  
2  113.421550  
3   70.803832  
4   83.309394  
5   61.728395  
6  124.810255  
     name      breed  color  height_m        BMI
4     Max   Labrador  Black      0.59  83.309394
0   Bella   Labrador  Brown      0.56  76.530612
3  Cooper  Schnauzer   Gray      0.49  70.803832
5  Stella  Chihuahua    Tan      0.18  61.7

# DataCamp Object-Oriented Programming in Python

In [10]:
import pandas as pd
from datetime import datetime


class LoggedDF(pd.DataFrame):
    
    def __init__(self, *args, **kwargs):
        pd.DataFrame.__init__(self, *args, **kwargs)
        self.created_at = datetime.today()
    #################################################################################################################
    # We've added .created_at attribute, but what to do if we want to add a .read_csv() method ????   Think further
    #################################################################################################################



ldf = LoggedDF([{'col1': 'data11', 'col2': 'data12', 'col3': 'data13', 'col4': 'data14'}, 
                {'col1': 'data21', 'col2': 'data22', 'col3': 'data23', 'col4': 'data24'}, 
                {'col1': 'data31', 'col2': 'data32', 'col3': 'data33', 'col4': 'data34'}, 
                {'col1': 'data41', 'col2': 'data42', 'col3': 'data43', 'col4': 'data44'}])

print(ldf)

     col1    col2    col3    col4
0  data11  data12  data13  data14
1  data21  data22  data23  data24
2  data31  data32  data33  data34
3  data41  data42  data43  data44


## Adding new columns

You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. 
# This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

homelessness is available and pandas is loaded as pd.
Instructions
100 XP

    Add a new column to homelessness, named total, containing the sum of the individuals and family_members columns.
    Add another column to homelessness, named p_individuals, containing the proportion of homeless people in each state who are individuals.


In [158]:
# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_individuals col as proportion of individuals
homelessness['p_invdividuals'] = homelessness['individuals']/homelessness['total']

# See the result
print(homelessness.head())

             region       state  individuals  family_members  state_pop  \
0  EastSouthCentral     Alabama       2570.0           864.0    4887681   
1           Pacific      Alaska       1434.0           582.0     735139   
2          Mountain     Arizona       7259.0          2606.0    7158024   
3  WestSouthCentral    Arkansas       2280.0           432.0    3009733   
4           Pacific  California     109008.0         20964.0   39461588   

      total  p_invdividuals  
0    3434.0        0.748398  
1    2016.0        0.711310  
2    9865.0        0.735834  
3    2712.0        0.840708  
4  129972.0        0.838704  


## Combo-attack!

# *******************************************************************************************************************
# You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

# *******************************************************************************************************************
In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new pandas skills to find out.
Instructions
100 XP

    Add a column to homelessness, indiv_per_10k, containing the number of homeless individuals per ten thousand people in each state.
    Subset rows where indiv_per_10k is higher than 20, assigning to high_homelessness.
    Sort high_homelessness by descending indiv_per_10k, assigning to high_homelessness_srt.
    Select only the state and indiv_per_10k columns of high_homelessness_srt and save as result. Look at the result.


In [159]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness['state_pop'] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness['indiv_per_10k']>20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values('indiv_per_10k', ascending=False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state', 'indiv_per_10k']]

# See the result
print(result)

                 state  indiv_per_10k
8   DistrictofColumbia      53.738381
11              Hawaii      29.079406
4           California      27.623825
37              Oregon      26.636307
28              Nevada      23.314189
47          Washington      21.829195
32             NewYork      20.392363


## Summary statistics



**In the first chapter, you learned about DataFrames, how to sort and subset them, and how to add new columns to them.  In this chapter, we'll talk about aggregating data, starting with summary statistics.  

Summary statistics as follows from their name, are numbers that summarize and tell you about your dataset.  One of the most common summary statistics for numeric data is the .mean(), which is one way of telling you where the center of your data is.  You can calculate the mean of a column by selecting the column with square brackets and calling .mean().  
# *******************************************************************************************************************
There are lots of other summary statistics that you can compute on columns, like .median(), .mode(), .min(), .max(), .var(), and .std().  You can also take .sum() and calculate the .quantile().  You can also get summary statistics for date columns.  For example we can find the oldest dog by taking the .min() on the date of birth column, dogs['date_of_birth'].min()

To aggregate or .agg() method allows you to compute custom summary statistics.  Here we def a function called pct30 that computes the 30th percentile of a DataFrame column.  Now we can subset the weight column and call .agg(pct30) passing the name of our defined function.  The .agg() can also be used on more thann one column, by selecting the weight and height columns before calling agg, we get the answer of both required columns.  We can also use .agg() to get multiple summary statistics at once.  We can actually pass a list of functions into .agg() method. 

# *******************************************************************************************************************
Pandas also has methods for computing cumulative statistics, for example the cumulative sum.  Calling .cumsum() on a column returns not just one number, but a number for each row of the DataFrame.  The first number is returned, or the number in the zeroth index, is the first dog's weight.  The next number is the sum of the first and second dog's weights.  The third number is the sum of the first, second, and the third dog's weights, and soon.  The last number is the sum of all the dog's weights.  
Pandas also has methods for other cumulative statistics, such as the cumulative maximum - .cummax(), cumulative minimum - .cummin(), and the cumulative product - .cumprod().  These methods all returns an entire column of DF.  

# *******************************************************************************************************************
In this chapter, you'll be working with data on Walmart stores, which is a chain of department store in the US.  The dataset contains weekly sales in US dollars in various stores.  Each store has an ID number and a specific store type.  The sales are also separated by department ID.  Along with weekly sales, there is information about whether it was a holiday week or not, the average temperature during the week in that location, the average fuel price in dullars per liter that week, and the national unemployment rate that week.  


def pct30(column):
    return column.quantile(0.3)
    
def pct40(column):
    return column.quantile(0.4)


---------------------------------------------------------------------------------------------------------------
store  | type   | dept    | date        | weekly_sales  | is_holiday   | temp_c    | fuel_price  | unemp
1      | A      | 1       | 2010-02-05  | 24924.50      | False        | 5.73      | 0.679       | 8.106
1      | A      | 2       | 2010-02-05  | 50605.27      | False        | 5.73      | 0.679       | 8.106
1      | A      | 3       | 2010-02-05  | 13740.12      | False        | 5.73      | 0.679       | 8.106
1      | A      | 4       | 2010-02-05  | 39954.04      | False        | 5.73      | 0.679       | 8.106
1      | A      | 5       | 2010-02-05  | 32229.38      | False        | 5.73      | 0.679       | 8.106



In [188]:
print(homelessness['indiv_per_10k'].mean())

print(homelessness['indiv_per_10k'].median())

print(round(homelessness['indiv_per_10k']).mode())

print(homelessness['indiv_per_10k'].min())

print(homelessness['indiv_per_10k'].max())

print(homelessness['indiv_per_10k'].var())

print(homelessness['indiv_per_10k'].std())

print(homelessness['indiv_per_10k'].sum())

print(homelessness['indiv_per_10k'].quantile())
print(homelessness['indiv_per_10k'].quantile(0.25))
print(homelessness['indiv_per_10k'].quantile(0.75))


print()
print(help(pd.Series.quantile))

10.430003339266477
7.122409297196753
0    6.0
dtype: float64
3.435065849943979
53.73838103505538
77.96469732888109
8.829762019946012
531.9301703025903
7.122409297196753
6.04980494709957
9.99472105823659

Help on function quantile in module pandas.core.series:

quantile(self, q=0.5, interpolation='linear')
    Return value at the given quantile.
    
    Parameters
    ----------
    q : float or array-like, default 0.5 (50% quantile)
        The quantile(s) to compute, which can lie in range: 0 <= q <= 1.
    interpolation : {'linear', 'lower', 'higher', 'midpoint', 'nearest'}
        This optional parameter specifies the interpolation method to use,
        when the desired quantile lies between two data points `i` and `j`:
    
            * linear: `i + (j - i) * fraction`, where `fraction` is the
              fractional part of the index surrounded by `i` and `j`.
            * lower: `i`.
            * higher: `j`.
            * nearest: `i` or `j` whichever is nearest.
         

In [189]:
values = '1 3 5 6 9 11 12 13 19 21 22 32 35 36 45 44 55 68 79 80 81 88 90 91 92 100 112 113 \
114 120 121 132 145 146 149 150 155 180 189 190'

vals = values.split(' ')
vals = [int(i) for i in vals]
print(vals)


import numpy as np
np_vals = np.array(vals)

print(np_vals.mean())
#print(np_vals.quantile())   # This is a Pandas Series method, not NumPy Array method

[1, 3, 5, 6, 9, 11, 12, 13, 19, 21, 22, 32, 35, 36, 45, 44, 55, 68, 79, 80, 81, 88, 90, 91, 92, 100, 112, 113, 114, 120, 121, 132, 145, 146, 149, 150, 155, 180, 189, 190]
78.85


In [199]:
import pandas as pd


df = pd.read_csv('dogs2.csv')
print(df)
print()

#print(dir(df['height_cn']))   # agg is there, 


def pct30(column):
    return column.quantile(0.3)

def pct40(column):
    return column.quantile(0.4)


#####################################################################################################################
print(df['height_cn'].agg([pct30, pct40]))

      name       breed  color  height_cn  weight_km data_of_birth
0    Bella    Labrador  Brown         56         24    2013-07-01
1  Charlie      Poodle  Black         43         24    2016-09-16
2     Lucy   Chow Chow  Brown         46         24    2014-08-25
3   Cooper   Schnauzer   Gray         49         17    2011-12-11
4      Max    Labrador  Black         59         29    2017-01-20
5   Stella   Chihuahua    Tan         18          2    2015-04-20
6   Bernie  St.Bernard  White         77         74    2018-02-27

pct30    45.4
pct40    47.2
Name: height_cn, dtype: float64


## Mean and median

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

sales is available and pandas is loaded as pd.
Instructions
70 XP

    Explore your new DataFrame first by printing the first few rows of the sales DataFrame.
    Print information about the columns in sales.
    Print the mean of the weekly_sales column.
    Print the median of the weekly_sales column.


In [1]:
import pandas as pd


sales = pd.read_csv('sales_subset.csv', index_col=0)
#print(sales.head())


# Print the head of the sales DataFrame
print(sales.head())
print()

# Print the info about the sales DataFrame
print(sales.info())
print()

# Print the mean of weekly_sales
print(sales['weekly_sales'].mean())

# Print the median of weekly_sales
print(sales['weekly_sales'].median())

   store type  department        date  weekly_sales  is_holiday  \
0      1    A           1  2010-02-05      24924.50       False   
1      1    A           1  2010-03-05      21827.90       False   
2      1    A           1  2010-04-02      57258.43       False   
3      1    A           1  2010-05-07      17413.94       False   
4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10774 entries, 0 to 10773
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   store                 10774 non-null  int64  
 1   

## Summarizing dates

Summary statistics can also be calculated on date columns that have values with the data type datetime64. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

sales is available and pandas is loaded as pd.
Instructions
100 XP

    Print the maximum of the date column.
    Print the minimum of the date column.


In [2]:
# Print the maximum of the date column
print(sales['date'].max())

# Print the minimum of the date column
print(sales['date'].min())

2012-10-26
2010-02-05


## Efficient summaries

While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

df['column'].agg(function)

In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

sales is available and pandas is loaded as pd.
Instructions 1/3
35 XP

    Question 1
    Use the custom iqr function defined for you along with .agg() to print the IQR of the temperature_c column of sales.
    
    
    
    Question 2
    Update the column selection to use the custom iqr function with .agg() to print the IQR of temperature_c, fuel_price_usd_per_l, and unemployment, in that order.
    
    
    
    Question 3
    Update the aggregation functions called by .agg(): include iqr and np.median in that order.


In [3]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales['temperature_c'].agg(iqr))

16.583333333333336


In [4]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)


# Print IQR of the temperature_c column
print(sales[['temperature_c', 'fuel_price_usd_per_l', 'unemployment']].agg(iqr))

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64


In [6]:
import numpy as np


# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)


#####################################################################################################################
# Print IQR of the temperature_c column
print(sales[['temperature_c', 'fuel_price_usd_per_l', 'unemployment']].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


## Cumulative statistics

# *******************************************************************************************************************
Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called sales_1_1 has been created for you, which contains the sales data for department 1 of store 1. pandas is loaded as pd.
Instructions
100 XP

    Sort the rows of sales_1_1 by the date column in ascending order.
    Get the cumulative sum of weekly_sales and add it as a new column of sales_1_1 called cum_weekly_sales.
    Get the cumulative maximum of weekly_sales, and add it as a column called cum_max_sales.
    Print the date, weekly_sales, cum_weekly_sales, and cum_max_sales columns.


In [10]:
#####################################################################################################################
#####################################################################################################################
sales_1_1 = sales[(sales['store']==1) & (sales['department']==1)]

print(sales_1_1)
print()


# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values('date', ascending=True)

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

    store type  department        date  weekly_sales  is_holiday  \
0       1    A           1  2010-02-05      24924.50       False   
1       1    A           1  2010-03-05      21827.90       False   
2       1    A           1  2010-04-02      57258.43       False   
3       1    A           1  2010-05-07      17413.94       False   
4       1    A           1  2010-06-04      17558.09       False   
5       1    A           1  2010-07-02      16333.14       False   
6       1    A           1  2010-08-06      17508.41       False   
7       1    A           1  2010-09-03      16241.78       False   
8       1    A           1  2010-10-01      20094.19       False   
9       1    A           1  2010-11-05      34238.88       False   
10      1    A           1  2010-12-03      22517.56       False   
11      1    A           1  2011-01-07      15984.24       False   

    temperature_c  fuel_price_usd_per_l  unemployment  
0        5.727778              0.679451         8.106  
1  

## Counting



# *******************************************************************************************************************
So far in this chapter, you've learned how to summarize numeric variables.  In this video you'll learn how to summarize categorical data using counting.  Counting dogs is no easy task when they're running around the park, its hard to keep track of who you have and havent counted.  

# *******************************************************************************************************************
# Here is a DataFrame that contains vet visits, the vet's office want to know how many dogs of each breed have visited their office.  However some dogs have been to the vet more than once, like Max and Stella, so we can't count the number of each breed in the breed column.  

**Lets try to fix this by removing rows that contain a dog name already listed earlier in the dataset, or in other words, we'll extract a dog with each name from the dataset once.  We can do this using .drop_duplicates() method, it takes an argument - subset=, which is the column we want to find our duplicates based on.  Now we have a list of dogs where each one apperas once.  

# *******************************************************************************************************************
**We have Max the Chow Chow, but where did Max the Labrador go?  Because we have two different dogs with the same name, we'll need to consider more than just name when droping duplicates.  Since Max the Chow Chow and Max the Labrador are different breed, we can drop the rows with pairs of name and breed we listed earlier in the dataset.  To base our duplicate dropping on multiple columns, we can pass a list of column names to the subset= argument, now both Maxes have been included and we can start counting.  


**To count the dogs of each breed, we'll subset the breed column and use the .value_counts() method.  We can also use the sort= argument to get the breeds with biggest counts on top.  Also the normalize= argument can be used to turn the counts into proportions of the total.


print(vet_visits)
-----------------------------------------------------
date          | name      | breed       | weight_kg
2018-09-02    | Bella     | Labrador    | 24.87
2019-06-07    | Max       | Labrador    | 28.35
2018-01-17    | Stella    | Chihuahua   | 1.51
2019-10-19    | Lucy      | Chow Chow   | 24.07
......
2018-01-20    | Stella    | Chihuahua   | 2.83
2019-06-07    | Max       | Chow Chow   | 24.01
2018-08-20    | Lucy      | Chow Chow   | 24.40
2019-04-22    | Max       | Labrador    | 28.54


vet_visits.drop_duplicates(subset='name')
-----------------------------------------------------
date          | name      | breed       | weight_kg
2018-09-02    | Bella     | Labrador    | 24.87
2019-06-07    | Max       | Chow Chow   | 24.01
2019-03-19    | Charlie   | Poodle      | 24.95
2018-01-17    | Stella    | Chihuahua   | 1.51
2019-10-19    | Lucy      | Chow Chow   | 24.07
2019-03-30    | Cooper    | Schnauzer   | 16.91
2019-01-04    | Bernie    | St. Bernard | 74.98


vet_visits.drop_duplicates(['name', 'breed'])
-----------------------------------------------------
date          | name      | breed       | weight_kg
2018-09-02    | Bella     | Labrador    | 24.87
2019-03-13    | Max       | Chow Chow   | 24.13
2019-03-19    | Charlie   | Poodle      | 24.95
2018-01-17    | Stella    | Chihuahua   | 1.51
2019-10-19    | Lucy      | Chow Chow   | 24.07
2019-06-07    | Max       | Chow Chow   | 24.01
2019-03-30    | Cooper    | Schnauzer   | 16.91
2019-01-04    | Bernie    | St. Bernard | 74.98


unique_dogs['breed'].value_counts(sort=True)
----------------------------------------------------
Labrador      2
Chow Chow     2
Schnauzer     1
St. Bernard   1
Poodle        1
Chihuahua     1
Name: breed, dtype: int64


unique_dogs['breed'].value_counts(sort=True, normalize=True)
----------------------------------------------------
Labrador      0.250
Chow Chow     0.250
Schnauzer     0.125
St. Bernard   0.125
Poodle        0.125
Chihuahua     0.125
Name: breed, dtype: int64

## Dropping duplicates

Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. In this exercise, you'll create some new DataFrames using unique values from sales.

sales is available and pandas is imported as pd.
Instructions
100 XP

    Remove rows of sales with duplicate pairs of store and type and save as store_types and print the head.
    Remove rows of sales with duplicate pairs of store and department and save as store_depts and print the head.
    Subset the rows that are holiday weeks using the is_holiday column, and drop the duplicate dates, saving as holiday_dates.
    Select the date column of holiday_dates, and print.


In [3]:
import pandas as pd


sales = pd.read_csv('sales_subset.csv', index_col=0)


# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=['store', 'type'])
print(store_types)

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=['store', 'department'])
print(store_depts)
print()

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales['is_holiday']==True].drop_duplicates(subset='date')
#####################################################################################################################

print(type(sales['is_holiday'][0]))
print()

# Print date col of holiday_dates
print(holiday_dates.head())

      store type  department        date  weekly_sales  is_holiday  \
0         1    A           1  2010-02-05      24924.50       False   
901       2    A           1  2010-02-05      35034.06       False   
1798      4    A           1  2010-02-05      38724.42       False   
2699      6    A           1  2010-02-05      25619.00       False   
3593     10    B           1  2010-02-05      40212.84       False   
4495     13    A           1  2010-02-05      46761.90       False   
5408     14    A           1  2010-02-05      32842.31       False   
6293     19    A           1  2010-02-05      21500.58       False   
7199     20    A           1  2010-02-05      46021.21       False   
8109     27    A           1  2010-02-05      32313.79       False   
9009     31    A           1  2010-02-05      18187.71       False   
9899     39    A           1  2010-02-05      21244.50       False   

      temperature_c  fuel_price_usd_per_l  unemployment  
0          5.727778            

## Counting categorical variables

Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])

The store_types and store_depts DataFrames you created in the last exercise are available, and pandas is imported as pd.
Instructions
100 XP

# *******************************************************************************************************************
#    Count the number of stores of each store type in store_types.
    Count the proportion of stores of each store type in store_types.
#    Count the number of different departments in store_depts, sorting the counts in descending order.
    Count the proportion of different departments in store_depts, sorting the proportions in descending order.


In [67]:
import pandas as pd


sales = pd.read_csv('sales_subset.csv', index_col=0)


store_types = sales.drop_duplicates(subset=['store', 'type'])
#####################################################################################################################
store_depts = sales.drop_duplicates(subset=['store', 'department'])


store_counts = store_types['type'].value_counts()
print(store_counts)

store_props = store_types['type'].value_counts(normalize=True)
print(store_props)

dept_counts_sorted = store_depts['department'].value_counts(sort=True)
print(dept_counts_sorted)

dept_props_sorted = store_depts['department'].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

A    11
B     1
Name: type, dtype: int64
A    0.916667
B    0.083333
Name: type, dtype: float64
1     12
55    12
72    12
71    12
67    12
      ..
37    10
48     8
50     6
39     4
43     2
Name: department, Length: 80, dtype: int64
1     0.012917
55    0.012917
72    0.012917
71    0.012917
67    0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Name: department, Length: 80, dtype: float64


## Grouped summary statistics



# *******************************************************************************************************************
**So far, you've been calculating summary statistics fro all rows of a dataset, but summary statistics can be useful to compare different groups.  While computing summary statistics of entire columns may be useful, you can gain many insights from summaries of individual groups.  

# *******************************************************************************************************************
**For example, does one color of dog weight more than another on average?  Are female dogs taller than males?  You can already answer these questions with what you've learned so far.  

**We can subset the dogs into groups based on their color, and take the mean of each.  But thats a lot of work, and the duplicated code means you can easily introduce copy and paste bugs.  

# *******************************************************************************************************************
**Thats where the .groupby() method comes in.  

We can group by the color variable, select the weight column and take the mean.  This will give us the mean weight for each dog color.  This was just one line of code compared to the five we had to write before to get the same results.  Just like with ungrouped summary statistics, we can use the .agg() method to get multiple statistics.  
**You can also group by multiple columns and calculate summary statistics.  Here we group by color and breed, select the weight column and take the mean.  
**You can also group by multiple columns and aggregate by multiple columns.  


dogs[dogs['color']=='black']['weight_kg'].mean()
dogs[dogs['color']=='brown']['weight_kg'].mean()
dogs[dogs['color']=='Tan']['weight_kg'].mean()
.........


dogs.groupby('color')['weight_kg'].mean()


dogs.groupby('color')['weight_kg'].agg([np.mean, np.median, pd.Series.quantile(0.7)])


dogs.groupby(['color', 'breed'])['weight_kg'].mean()


dogs.groupby(['color', 'breed'])[['weight_kg', 'height_cm']].mean()



In [68]:
import pandas as pd
import numpy as np


dogs = pd.read_csv('dogs2.csv')
print(dogs.head())


def quantile75(column):
    return column.quantile(0.75)

                                                                 ## pd.Series.quantile(0.75) approach failed
dogs.groupby('color')['weight_km'].agg([np.mean, np.median, np.var, quantile75])
#dogs.groupby('color')['weight_km'].agg([np.mean, np.median, np.var, np.quantile(self, 0.75)])   ### Is a NP function
#####################################################################################################################
# The purpoise of quantiles?  So we can imagine the destribution curve??


#print(dir(pd.Series))   # quantile is included

#print(help(pd.Series.quantile))

#print(help(np.quantile))

      name      breed  color  height_cn  weight_km data_of_birth
0    Bella   Labrador  Brown         56         24    2013-07-01
1  Charlie     Poodle  Black         43         24    2016-09-16
2     Lucy  Chow Chow  Brown         46         24    2014-08-25
3   Cooper  Schnauzer   Gray         49         17    2011-12-11
4      Max   Labrador  Black         59         29    2017-01-20


Unnamed: 0_level_0,mean,median,var,quantile75
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Black,26.5,26.5,12.5,27.75
Brown,24.0,24.0,0.0,24.0
Gray,17.0,17.0,,17.0
Tan,2.0,2.0,,2.0
White,74.0,74.0,,74.0


In [71]:
import numpy as np


our_data = np.array([1, 2, 3, 4, 5, 6, 7])

#print(our_data.quantile(0.75))  # AttributeError: 'numpy.ndarray' object has no attribute 'quantile'
#####################################################################################################################
print(np.quantile(our_data, 0.75))

5.5


## What percent of sales occurred at each store type?

While .groupby() is useful, you can calculate grouped summary statistics without it.

Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this exercise, you'll calculate the total sales made at each store type, without using .groupby(). You can then use these numbers to see what proportion of Walmart's total sales were made at each type.

sales is available and pandas is imported as pd.
Instructions
100 XP

    Calculate the total weekly_sales over the whole dataset.
    Subset for type "A" stores, and calculate their total weekly sales.
    Do the same for type "B" and type "C" stores.
    Combine the A/B/C results into a list, and divide by sales_all to get the proportion of sales by type.


In [53]:
import pandas as pd


sales = pd.read_csv('sales_subset.csv')
#print(sales.head())


# Calc total weekly sales
sales_all = sales['weekly_sales'].sum()
print(sales_all)

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales['type']=='A']['weekly_sales'].sum()

# Subset for type B stores, calc total weekly sales
sales_B = sales[sales['type']=='B']['weekly_sales'].sum()

# Subset for type C stores, calc total weekly sales
sales_C = sales[sales['type']=='C']['weekly_sales'].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C]/sales_all
#####################################################################################################################
# I thought you put sales_A, sales_B sales_C into a list, it becomes a list object, broadcasting not working anymore

question = [sales_A, sales_B, sales_C]
print(type(question))  # <class 'list'>
#####################################################################################################################
print(type(sales_propn_by_type))    # So NumPy Array infected the list into working broadcasting?  Very Interesting


print(sales_propn_by_type)

256894718.89999998
<class 'list'>
<class 'numpy.ndarray'>
[0.9097747 0.0902253 0.       ]


## Calculations with .groupby()

The .groupby() method makes life much easier. In this exercise, you'll perform the same calculations as last time, except you'll use the .groupby() method. You'll also perform calculations on data grouped by two variables to see if sales differ by store type depending on if it's a holiday week or not.

sales is available and pandas is loaded as pd.
Instructions 1/2
50 XP

    Question 1
    Group sales by "type", take the sum of "weekly_sales", and store as sales_by_type.
    Calculate the proportion of sales at each store type by dividing by the sum of sales_by_type. Assign to sales_propn_by_type.
    
    
    
    Question 2
#    Group sales by "type" and "is_holiday", take the sum of weekly_sales, and store as sales_by_type_is_holiday


In [59]:
import pandas as pd


sales = pd.read_csv('sales_subset.csv')
#print(sales.head())

# Group by type; calc total weekly sales
sales_by_type = sales.groupby('type')['weekly_sales'].sum()


# Get proportion for each type
sales_propn_by_type = sales_by_type/sum(sales_by_type)
print(sales_propn_by_type)



#####################################################################################################################
#####################################################################################################################
sales_by_type_is_holiday = sales.groupby(['type', 'is_holiday'])['weekly_sales'].sum()
print(sales_by_type_is_holiday)


# Think why we want this?  wha's the purpose of it?  
#####################################################################################################################
# So based on the prospective of Walmart, 1) weekly_sales means a lot for each store, 2) is_holiday had the
# huge influence on weekly_sales (like unemployment_rate and fuel_price, those influence people's mind on shoping)
# And we also want to know the performance of each type, 'supercenters', 'discount stores', 'neighborhood markets',

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64
type  is_holiday
A     False         2.336927e+08
      True          2.360181e+04
B     False         2.317678e+07
      True          1.621410e+03
Name: weekly_sales, dtype: float64


## Multiple grouped summaries

Earlier in this chapter, you saw that the .agg() method is useful to compute multiple statistics on multiple variables. It also works with grouped data. NumPy, which is imported as np, has many different summary statistics functions, including: np.min, np.max, np.mean, and np.median.

sales is available and pandas is imported as pd.
Instructions
100 XP

    Import numpy with the alias np.
    Get the min, max, mean, and median of weekly_sales for each store type using .groupby() and .agg(). Store this as sales_stats. Make sure to use numpy functions!
#    Get the min, max, mean, and median of unemployment and fuel_price_usd_per_l for each store type. Store this as unemp_fuel_stats.


In [66]:
import numpy as np
import pandas as pd


sales = pd.read_csv('sales_subset.csv')
#print(sales.head())


sales_stats = sales.groupby('type')['weekly_sales'].agg([np.min, np.max, np.mean, np.median])
print(sales_stats)


#####################################################################################################################
unemp_fuel_stats=sales.groupby('type')[['unemployment','fuel_price_usd_per_l']].agg([np.min,np.max,np.mean,np.median])
print(unemp_fuel_stats)

        amin       amax          mean    median
type                                           
A    -1098.0  293966.05  23674.667242  11943.92
B     -798.0  232558.51  25696.678370  13336.08
     unemployment                         fuel_price_usd_per_l            \
             amin   amax      mean median                 amin      amax   
type                                                                       
A           3.879  8.992  7.972611  8.067             0.664129  1.107410   
B           7.170  9.765  9.279323  9.199             0.760023  1.107674   

                          
          mean    median  
type                      
A     0.744619  0.735455  
B     0.805858  0.803348  


## Pivot tables



# *******************************************************************************************************************
**Pivot tables are another way of calculating grouped summary statistics.  Lets see how to create pivot tables in Pandas.  In the last lesson, we grouped the dogs by color and calculated their mean weights.  

# We can do the same thing using the .pivot_table() method.  
# *******************************************************************************************************************
The values= argument is the column that you want to summarize, and the 
index= argument is the column that you want to groupby.  
By default, .pivot_table() takes the mean value for each group.  If we want a different summary statistics, we can use the aggfunc= argument and pass it a function.  Here we take the median for each dog color using NumPy's .median() function.  

To get multiple summary statistics at a time, we can pass a list of functions into the aggfunc= argument, belwo we get the median and max weight_kg for each colors group.  


**We also previously computed the mean weight_kg grouped by two variables, dog color and dog breed.  We can also do this using the .pivot_table() method.  

# *******************************************************************************************************************
To group by two variables, we can pass a second variable name into the columns= argument.  While the result looks a little different than what we had before, it contains the same numbers.  There are NaNs, or missing values, because there are no black Chihuahuas or gray Labradors in our dataset, for example.  

Instead of having lots of missing values in our pivot table, we can have them filled in using the fill_value=0 argument.  If we set the margins= argument to True, the last row and last column of the pivot_table contain the mean of all the values in the column or row, not including the missing values that were filled with 0s.  The value in the bottom right, in the last row and last column, is the mean weight_kg of all the dogs in our dataset.  

Using margins=True allows us to see a summary statistics for multiple levels of dataset: the entire dataset, the grouped by one variable, by another variable, or by two variables. 



dogs.groupby('color')['weight_kg'].median()

dogs.pivot_table(values='weight_kg', index='colors')

dogs.pivot_table(values='weight_kg', index='colors', aggfunc=np.median)

dogs.pivot_table(values='weight_kg', index='colors', aggfunc=[np.median, np.max])


dog.groupby(['color', 'breed'])['weight_kg'].mean()

dog.pivot_table(values=['weight_kg', 'height_cm'], index='color', columns='breed', aggfunc=[np.median, np.max])

dog.pivot_table(values=['weight_kg', 'height_cm'], index='color', columns='breed', fill_value=0, aggfunc=[np.median, np.max])

dog.pivot_table(values=['weight_kg', 'height_cm'], index='color', columns='breed', fill_value=0, margins=True)



dog.pivot_table(values=['weight_kg', 'height_cm'], index=['color', 'breed'], aggfunc=[np.median, np.max])
## Do Not Use This One, cause two level of indexing makes it very hard to read on the scream


In [86]:
import pandas as pd


dogs = pd.read_csv('dogs2.csv')
print(dogs.head())


# One line of code Yes, elegant No
#####################################################################################################################
print(dogs.pivot_table(values=['weight_km', 'height_cn'], index=['color', 'breed'], aggfunc=[np.median, np.max]))
                                                          ####### DO Not Take This Approach
                                                          ####### Cause its hard to read infos


#####################################################################################################################
print(dogs.pivot_table(values=['weight_km','height_cn'], index='color', columns='breed', aggfunc=[np.median,np.max]))


#####################################################################################################################
print(dogs.pivot_table(values=['weight_km','height_cn'], index='color', columns='breed', fill_value=0))


#####################################################################################################################
print(dogs.pivot_table(values=['weight_km','height_cn'], index='color', columns='breed', fill_value=0, margins=True))

      name      breed  color  height_cn  weight_km data_of_birth
0    Bella   Labrador  Brown         56         24    2013-07-01
1  Charlie     Poodle  Black         43         24    2016-09-16
2     Lucy  Chow Chow  Brown         46         24    2014-08-25
3   Cooper  Schnauzer   Gray         49         17    2011-12-11
4      Max   Labrador  Black         59         29    2017-01-20
                    median                amax          
                 height_cn weight_km height_cn weight_km
color breed                                             
Black Labrador          59        29        59        29
      Poodle            43        24        43        24
Brown Chow Chow         46        24        46        24
      Labrador          56        24        56        24
Gray  Schnauzer         49        17        49        17
Tan   Chihuahua         18         2        18         2
White St.Bernard        77        74        77        74
         median                         

## Pivoting on one variable

# *******************************************************************************************************************
# *******************************************************************************************************************
# Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the .pivot_table() method is just an alternative to .groupby().

In this exercise, you'll perform calculations using .pivot_table() to replicate the calculations you performed in the last lesson using .groupby().

sales is available and pandas is imported as pd.
Instructions 1/3
35 XP

    Question 1
    Get the mean weekly_sales by type using .pivot_table() and store as mean_sales_by_type.
    
    
    
    Question 2
#    Get the mean and median (using NumPy functions) of weekly_sales by type using .pivot_table() and store as mean_med_sales_by_type.
    
    
    
    Question 3
#    Get the mean of weekly_sales by type and is_holiday using .pivot_table() and store as mean_sales_by_type_holiday.


In [95]:
import pandas as pd


sales = pd.read_csv('sales_subset.csv')
print(sales.head())

# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values='weekly_sales', index='type')
                                                                         # .mean() Do Not Add else mean everything
#####################################################################################################################


# Print mean_sales_by_type
print(mean_sales_by_type)



import numpy as np

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values='weekly_sales', index='type', aggfunc=[np.mean, np.median])

print(mean_med_sales_by_type )




mean_sales_by_type_holiday = sales.pivot_table(values='weekly_sales', index='type', columns='is_holiday')
print(mean_sales_by_type_holiday)

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  
      weekly_sales
type              
A     23674.667242
B     25696.678370
              mean       median
      weekly_sales weekly_sales
type                           
A     23674.667242     

## Fill in missing values and sum values with pivot tables

# *******************************************************************************************************************
# *******************************************************************************************************************
# The .pivot_table() method has several useful arguments, including fill_value and margins.

    fill_value replaces missing values with a real value (known as imputation). What to replace missing values with is a topic big enough to have its own course (Dealing with Missing Data in Python), but the simplest thing to do is to substitute a dummy value.
#    margins is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.

In this exercise, you'll practice using these arguments to up your pivot table skills, which will help you crunch numbers more efficiently!

sales is available and pandas is imported as pd.
Instructions 1/2
50 XP

    Question 1
    Print the mean weekly_sales by department and type, filling in any missing values with 0.
    
    
    
    Question 2
    Print the mean weekly_sales by department and type, filling in any missing values with 0 and summing all rows and columns.


In [97]:
# Print mean weekly_sales by department and type; fill missing values with 0
print(sales.pivot_table(values='weekly_sales', index='department', columns='type', fill_value=0))




# Print mean weekly_sales by department and type; fill missing values with 0, and summing all rows and columns
print(sales.pivot_table(values='weekly_sales', index='department', columns='type', fill_value=0, margins=True))

type                    A              B
department                              
1            30961.725379   44050.626667
2            67600.158788  112958.526667
3            17160.002955   30580.655000
4            44285.399091   51219.654167
5            34821.011364   63236.875000
...                   ...            ...
95          123933.787121   77082.102500
96           21367.042857    9528.538333
97           28471.266970    5828.873333
98           12875.423182     217.428333
99             379.123659       0.000000

[80 rows x 2 columns]
type                   A              B           All
department                                           
1           30961.725379   44050.626667  32052.467153
2           67600.158788  112958.526667  71380.022778
3           17160.002955   30580.655000  18278.390625
4           44285.399091   51219.654167  44863.253681
5           34821.011364   63236.875000  37189.000000
...                  ...            ...           ...
96          

## Explicit indexes



# *******************************************************************************************************************
In chapter one, we saw that DataFrames are composed of 3 parts: a NumPy array for the data, and two indexes to store the row and column details.  

Here is the dogs dataset again, racall that dogs.columns attribute contains an index object of column names, and dogs.index attribute contains an index object of row numbers.  

**You can move a column from the body of the DataFrame to be index.  This is called setting an index, and it uses the .set_index() method.  Notice that the output has changed slightly, in particular, a quick visual clue that name is now in the index is that the index values are left-aligned rather than right-aligned.  

**To undo what we just did, we can apply the .reset_index() method to the DataFrame.  The reset_index() has a drop= argument that allows you to discard an index.  Here setting drop=True removes the dog names column.  



You may be wondering why you should bother with indexes, the answer is that it makes subsetting code cleaner.  
# *******************************************************************************************************************
# Consider this example of subsetting for the rows where the dogs is called Bella or Stella.  Its a fairly tricky line of code for such a simple task.  Now look at the equivalent when the name column is the index.  

DataFrame have a subsetting method called .loc[], which filters on the index values.  Much easier. 


dogs[dogs['name'].isin(['Bella', 'Stella'])]
---------------------------------------------------------------------------------------------------------------------

new_dogs.loc['Bella', 'Stella']]
---------------------------------------------------------------------------------------------------------------------



# *******************************************************************************************************************
The values in the index don't need to be unique, you can set breed as index with two Labrador in it.  Then if we subset on Labrador using .loc[] method, all the Labrador data is retuined.  

You can include multiple columns in the index by passing a list of column names to .set_index() method.  These are called multi-level indexes, or hierarchical indexes.  

# *******************************************************************************************************************
There is an implication here that the inner level of index, in this case: color, is nested inside the outer level: breed.  To take a subset of rows at the outer level index, you pass a list of index values to .loc[] method.  Here the list contains Labrador and Chihuahua, and the resulting subset contains all dogs from both breeds.  

To subset on inner level index, you need to pass a list of tuples.  Here the first tuple specifies Labrador at the outer level of index, and Brown at the inner level.  The resulting rows have to match all conditions from a tuple.  For example the black Labrador wasn't returned because the Brown condition wasn't matched.  



new1_dogs = dogs.set_index(['breed', 'color'])


new1_dogs.loc(['Labrador', 'Chihuahua'])



new1_dogs.loc[('Labrador', 'Brown')]

new1_dogs.loc[[('Labrador', 'Brown'), ('Chihuahua', 'Tan')]]






# *******************************************************************************************************************

**In chapter 1, you saw how to sort the rows of a DataFrame using .sort_values() method.  You can also sort by index values using .sort_index() method.  By default, it sorts all index levels from outer to inner, in ascending order.  You can control the sorting by passing lists to the level= and ascending= arguments.  


# *******************************************************************************************************************
Indexes are controversal.  Although they simplify subsetting sode, there are also some downsides.  Index values are just data, and storing data in multiple forms makes it harder to think aboyt.  

There is a concept called tidy data, where data is stored in tabular form - like a DataFrame.  Each row contains a single observation, and each variable is stored in its own column.  Indexes violate the last rule since index values don't get they own column.  

In Pandas, the syntax for working with indexes is different from the syntax for working with columns.  By using two syntaxes, your code is more complicated, which can result in more bugs.  

If you decide you don't want to use indexes, that's perfectly reasonable.  However, its useful to know how they work for cases when you need to read other people's code.  In this chapter, you'll work with a monthly time series of air temperatures in cities around the world.  



new1_dogs.sort_index()


new1_dogs.sort_index(level=['color', 'breed'], ascending=[True, False])




--------------------------------------------------------------
  date        | city       | country        | avg_temp_c
  2000-01-01  | Abidjan    | Cote D'lvoire  | 27.293
  2000-02-01  | Abidjan    | Cote D'lvoire  | 27.685
  2000-03-01  | Abidjan    | Cote D'lvoire  | 29.061
  2000-04-01  | Abidjan    | Cote D'lvoire  | 28.162
  2000-05-01  | Abidjan    | Cote D'lvoire  | 27.547
  
  

In [9]:
import pandas as pd


dogs = pd.read_csv('dogs2.csv')

new_dogs = dogs.set_index('name')
print(new_dogs.head())
print(new_dogs.index)
#print(help(pd.DataFrame.set_index))
#print(help(dogs.set_index))
#####################################################################################################################
# I want to remove the first row after the header, the empty value row


nn_dogs = new_dogs.reset_index(drop=True)
print(nn_dogs.head())
print()

print(dogs[dogs['name'].isin(['Bella', 'Stella'])])
print(new_dogs.loc[['Bella', 'Stella']])



print()
#####################################################################################################################
co_breed = dogs.set_index(['color', 'breed'])
print(co_breed)

print()
print(co_breed.loc['Brown', 'Labrador'])
#####################################################################################################################
print(co_breed.loc[[('Brown', 'Labrador'), ('Tan', 'Chihuahua')]])    ## Double square brackets
print(co_breed.loc[[('Brown', 'Labrador'), ('Black', 'Labrador')]])




print(co_breed.sort_index(level=['color', 'breed'], ascending=[True, False]))
#####################################################################################################################
# Did you niticed? When put two indexes level nested here, it becomes harder for human eyes to read information

             breed  color  height_cn  weight_km data_of_birth
name                                                         
Bella     Labrador  Brown         56         24    2013-07-01
Charlie     Poodle  Black         43         24    2016-09-16
Lucy     Chow Chow  Brown         46         24    2014-08-25
Cooper   Schnauzer   Gray         49         17    2011-12-11
Max       Labrador  Black         59         29    2017-01-20
Index(['Bella', 'Charlie', 'Lucy', 'Cooper', 'Max', 'Stella', 'Bernie'], dtype='object', name='name')
       breed  color  height_cn  weight_km data_of_birth
0   Labrador  Brown         56         24    2013-07-01
1     Poodle  Black         43         24    2016-09-16
2  Chow Chow  Brown         46         24    2014-08-25
3  Schnauzer   Gray         49         17    2011-12-11
4   Labrador  Black         59         29    2017-01-20

     name      breed  color  height_cn  weight_km data_of_birth
0   Bella   Labrador  Brown         56         24    2013-07-01

## Setting and removing indexes

pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In this chapter, you'll be exploring temperatures, a DataFrame of average temperatures in cities around the world. pandas is loaded as pd.
Instructions
100 XP

    Look at temperatures.
    Set the index of temperatures to "city", assigning to temperatures_ind.
    Look at temperatures_ind. How is it different from temperatures?
    Reset the index of temperatures_ind, keeping its contents.
    Reset the index of temperatures_ind, dropping its contents.


In [15]:
import pandas as pd


temperatures = pd.read_csv('temperatures.csv', index_col=0)
#print(temperatures.head())


# Look at temperatures
print(temperatures.head())

# Index temperatures by city
temperatures_ind = temperatures.set_index('city')

# Look at temperatures_ind
print(temperatures_ind.head())

# Reset the index, keeping its contents
print(temperatures_ind.reset_index().head())

# Reset the index, dropping its contents
print(temperatures_ind.reset_index(drop=True).head())

         date     city        country  avg_temp_c
0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
               date        country  avg_temp_c
city                                          
Abidjan  2000-01-01  Côte D'Ivoire      27.293
Abidjan  2000-02-01  Côte D'Ivoire      27.685
Abidjan  2000-03-01  Côte D'Ivoire      29.061
Abidjan  2000-04-01  Côte D'Ivoire      28.162
Abidjan  2000-05-01  Côte D'Ivoire      27.547
      city        date        country  avg_temp_c
0  Abidjan  2000-01-01  Côte D'Ivoire      27.293
1  Abidjan  2000-02-01  Côte D'Ivoire      27.685
2  Abidjan  2000-03-01  Côte D'Ivoire      29.061
3  Abidjan  2000-04-01  Côte D'Ivoire      28.162
4  Abidjan  2000-05-01  Côte D'Ivoire      27.547
         date        country  avg_temp_c
0  2000-01-01  Côte D'Ivoire  

## Subsetting with .loc[]

# *******************************************************************************************************************
# The killer feature for indexes is .loc[]: a subsetting method that accepts index values. When you pass it a single argument, it will take a subset of rows.

The code for subsetting using .loc[] can be easier to read than standard square bracket subsetting, which can make your code less burdensome to maintain.

pandas is loaded as pd. temperatures and temperatures_ind are available; the latter is indexed by city.
Instructions
100 XP

    Create a list called cities that contains "Moscow" and "Saint Petersburg".
    Use [] subsetting to filter temperatures for rows where the city column takes a value in the cities list.
    Use .loc[] subsetting to filter temperatures_ind for rows where the city is in the cities list.


In [20]:
import pandas as pd


temperatures = pd.read_csv('temperatures.csv', index_col=0)

temperatures_ind = temperatures.set_index('city')

# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]

# Subset temperatures using square brackets
print(temperatures[temperatures['city'].isin(cities)])

# Subset temperatures_ind using .loc[]
#print(temperatures.loc[cities])
#####################################################################################################################
print(temperatures_ind.loc[cities])

             date              city country  avg_temp_c
10725  2000-01-01            Moscow  Russia      -7.313
10726  2000-02-01            Moscow  Russia      -3.551
10727  2000-03-01            Moscow  Russia      -1.661
10728  2000-04-01            Moscow  Russia      10.096
10729  2000-05-01            Moscow  Russia      10.357
...           ...               ...     ...         ...
13360  2013-05-01  Saint Petersburg  Russia      12.355
13361  2013-06-01  Saint Petersburg  Russia      17.185
13362  2013-07-01  Saint Petersburg  Russia      17.234
13363  2013-08-01  Saint Petersburg  Russia      17.153
13364  2013-09-01  Saint Petersburg  Russia         NaN

[330 rows x 4 columns]
                        date country  avg_temp_c
city                                            
Moscow            2000-01-01  Russia      -7.313
Moscow            2000-02-01  Russia      -3.551
Moscow            2000-03-01  Russia      -1.661
Moscow            2000-04-01  Russia      10.096
Moscow    

## Setting multi-level indexes

# Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

# *******************************************************************************************************************
The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial, you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside the treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside the country.

# *******************************************************************************************************************
The main downside is that the code for manipulating indexes is different from the code for manipulating columns, so you have to learn two syntaxes and keep track of how your data is represented.

pandas is loaded as pd. temperatures is available.
Instructions
100 XP

    Set the index of temperatures to the "country" and "city" columns, and assign this to temperatures_ind.
    Specify two country/city pairs to keep: "Brazil"/"Rio De Janeiro" and "Pakistan"/"Lahore", assigning to rows_to_keep.
    Print and subset temperatures_ind for rows_to_keep using .loc[].


In [23]:
import pandas as pd


temperatures = pd.read_csv('temperatures.csv', index_col=0)


# Index temperatures by country & city
temperatures_ind = temperatures.set_index(['country', 'city'])
print(temperatures_ind.head())

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [('Brazil', 'Rio De Janeiro'), ('Pakistan', 'Lahore')]
#####################################################################################################################
print(temperatures_ind.loc[rows_to_keep])

                             date  avg_temp_c
country       city                           
Côte D'Ivoire Abidjan  2000-01-01      27.293
              Abidjan  2000-02-01      27.685
              Abidjan  2000-03-01      29.061
              Abidjan  2000-04-01      28.162
              Abidjan  2000-05-01      27.547
                               date  avg_temp_c
country  city                                  
Brazil   Rio De Janeiro  2000-01-01      25.974
         Rio De Janeiro  2000-02-01      26.699
         Rio De Janeiro  2000-03-01      26.270
         Rio De Janeiro  2000-04-01      25.750
         Rio De Janeiro  2000-05-01      24.356
...                             ...         ...
Pakistan Lahore          2013-05-01      33.457
         Lahore          2013-06-01      34.456
         Lahore          2013-07-01      33.279
         Lahore          2013-08-01      31.511
         Lahore          2013-09-01         NaN

[330 rows x 2 columns]


## Sorting by index values

Previously, you changed the order of the rows in a DataFrame by calling .sort_values(). It's also useful to be able to sort by elements in the index. For this, you need to use .sort_index().

pandas is loaded as pd. temperatures_ind has a multi-level index of country and city, and is available.
Instructions
100 XP

    Sort temperatures_ind by the index values.
    Sort temperatures_ind by the index values at the "city" level.
    Sort temperatures_ind by ascending country then descending city.


In [29]:
import pandas as pd


temperatures = pd.read_csv('temperatures.csv')
print(temperatures.head())



temperature_ind = temperatures.set_index(['country', 'city'])
print(temperatures_ind.head())


# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level='city'))

# Sort temperatures_ind by country then descending city
#####################################################################################################################
print(temperatures_ind.sort_index(level=['country', 'city'], ascending=[True, False]))


   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
                             date  avg_temp_c
country       city                           
Côte D'Ivoire Abidjan  2000-01-01      27.293
              Abidjan  2000-02-01      27.685
              Abidjan  2000-03-01      29.061
              Abidjan  2000-04-01      28.162
              Abidjan  2000-05-01      27.547
                          date  avg_temp_c
country     city                          
Afghanistan Kabul   2000-01-01       3.326
            Kabul   2000-02-01       3.454
            Kabul   2000-03-01       9.612
            Kabul   2000-04-01      17.925
            Kabul   2000-05-01      24.658
...  

## Slicing and subsetting with .loc and .iloc




**Slicing is a technique for selecting consecutive elements from objects.  Below is a dog breed list, to slice a list, you pass first and last positioon indexes separated by a colon into a pair of square brackets breeds[2:4] - meaning selecting 3th to 4th elements in list.  Doing slicing from the begining of the list, we can use 0 or just leave it empty - breeds[:4] or breeds[0:4] - meaning selecting first 4 elements.  also breeds[4:] or breeds[:] meaning selecting 5th till last element, and selecting all elements.  


# *******************************************************************************************************************
# We can also slicing DataFrames, but we need to sort the index first.  

Here the dogs dataset has been given a multi-level index of breed and color, then the index is sorted with .sort_index() method.  To slice rows at the outer level of an index, you call .loc[] and passing first and last values separated by a colon.  

# *******************************************************************************************************************
There are two differences compared to slicing lists.  Rather that specifying row nembers, you specifying index values.  Secondly, notice that the final value is included.  

# *******************************************************************************************************************
The same technique doen't work on inner index levels, Here trying to slice from Tan to Grey using dogs_sort.loc['Tan': 'Gray'] returns an empty DataFrame instead of 6 dogs info we wanted.  Thus its important to understand the danger here, Pandas doesn't throw an error to let you know that there is a problem, so be careful when doing so.  
* The correst approach to slicing an inner index levels is to pass the first and last positions as tuples - dogs_sort.loc[('Chihuahua', 'Tan'): ('Schnauzer', 'Gray')].  



# *******************************************************************************************************************
# Since DataFrames are two-dimensional objects, you can also slice columns.  We do this by passing two arguments to .loc[] method.  The simplest case involves subsetting columns but keeping all rows.  like here using dogs_sorted.loc[:, 'name': 'weight_kg'].  As with slicing lists, a colon by itself means keep everything.  

We can also slice on rows and columns at the same time, simply pass the appropriate slice to each argument.  


# *******************************************************************************************************************
# An important use case of slicing is to subset DataFrame by a range of dates.  
To demostrate this, lets set the date_of_birth column as the index and applying .sort_index().  
* You slice dates with the same syntax as other types.  
One helpful feature is that you can slice by partical dates.  Like this dogs_dates.loc['2015': '2017'].  The Pandas interprets this as slicing from the start of 2015 to the end of 2017.  


# *******************************************************************************************************************
# You can also slice DataFrames by row or column number using the .iloc[] method.  This uses similar syntax to slicing lists, except that there are two arguments: one for rows and one for columns.  

Notice that, like list slicing but unlike .loc[], the final values aren't included in the slice.  



breeds = ['Labrador', 'Poodle', 'Chow Chow', 'Schnauzer', 'Labrador', 'Chihuahua', 'St. Bernard']

breeds[1:3]
# Remember 0 indexing, and last value in square bracket are not included



dogs_sort = dogs.set_index(['breed', 'color']).sort_index


dogs_dates = dogs.set_index('date_of_birth').sort_index()

dogs_dates.loc['2015': '2017']


In [50]:
import pandas as pd



dogs = pd.read_csv('dogs2.csv')

dogs_sorted = dogs.set_index(['breed', 'color']).sort_index()
print(dogs_sorted)


#####################################################################################################################
print(dogs_sorted.loc['Chihuahua': 'Poodle'])


#####################################################################################################################
print(dogs_sorted.loc[('Chihuahua', 'Tan'): ('Schnauzer', 'Gray')])


#####################################################################################################################
print(dogs_sorted.loc[:, 'name':'weight_km'])


# Just not elegant, not Pythonic at all
#####################################################################################################################
print(dogs_sorted.loc[('Chihuahua', 'Tan'): ('Schnauzer', 'Gray'), 'name': 'weight_km'])





dogs_dates = dogs.set_index('data_of_birth').sort_index()
print(dogs_dates)



#####################################################################################################################
date2013_2017 = dogs_dates.loc['2013-01-01': '2017-12-30']

print(date2013_2017)

                     name  height_cn  weight_km data_of_birth
breed      color                                             
Chihuahua  Tan     Stella         18          2    2015-04-20
Chow Chow  Brown     Lucy         46         24    2014-08-25
Labrador   Black      Max         59         29    2017-01-20
           Brown    Bella         56         24    2013-07-01
Poodle     Black  Charlie         43         24    2016-09-16
Schnauzer  Gray    Cooper         49         17    2011-12-11
St.Bernard White   Bernie         77         74    2018-02-27
                    name  height_cn  weight_km data_of_birth
breed     color                                             
Chihuahua Tan     Stella         18          2    2015-04-20
Chow Chow Brown     Lucy         46         24    2014-08-25
Labrador  Black      Max         59         29    2017-01-20
          Brown    Bella         56         24    2013-07-01
Poodle    Black  Charlie         43         24    2016-09-16
               

## Slicing index values

Slicing lets you select consecutive elements of an object using first:last syntax. DataFrames can be sliced by index values or by row/column number; we'll start with the first case. This involves slicing inside the .loc[] method.

# Compared to slicing lists, there are a few things to remember.

    You can only slice an index if the index is sorted (using .sort_index()).
    To slice at the outer level, first and last can be strings.
    To slice at inner levels, first and last should be tuples.
    If you pass a single slice to .loc[], it will slice the rows.

pandas is loaded as pd. temperatures_ind has country and city in the index, and is available.
Instructions
100 XP

    Sort the index of temperatures_ind.
    Use slicing with .loc[] to get these subsets:
        from Pakistan to Russia.
        from Lahore to Moscow. (This will return nonsense.)
        from Pakistan, Lahore to Russia, Moscow.


In [55]:
import pandas as pd


temperatures = pd.read_csv('temperatures.csv')
print(temperatures.head())



temperatures_ind = temperatures.set_index(['country', 'city']).sort_index()
print(temperatures_ind.head())



# Subset rows from Pakistan to Russia
print(temperatures_ind.loc['Pakistan': 'Russia'])

# Try to subset rows from Lahore to Moscow
print(temperatures_ind.loc[('Pakistan', 'Lahore'): ('Russia', 'Moscow')])

# Subset rows from Pakistan, Lahore to Russia, Moscow
#print(____)

   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
                   Unnamed: 0        date  avg_temp_c
country     city                                     
Afghanistan Kabul        7260  2000-01-01       3.326
            Kabul        7261  2000-02-01       3.454
            Kabul        7262  2000-03-01       9.612
            Kabul        7263  2000-04-01      17.925
            Kabul        7264  2000-05-01      24.658
                           Unnamed: 0        date  avg_temp_c
country  city                                                
Pakistan Faisalabad              4785  2000-01-01      12.792
         Faisalabad              4786  2000-02-01      14.339
  

## Slicing in both directions

# You've seen slicing DataFrames by rows and by columns, but since DataFrames are two-dimensional objects, it is often natural to slice both dimensions at once. That is, by passing two arguments to .loc[], you can subset by rows and columns in one go.

pandas is loaded as pd. temperatures_srt is indexed by country and city, has a sorted index, and is available.
Instructions
100 XP

    Use .loc[] slicing to subset rows from India, Hyderabad to Iraq, Baghdad.
    Use .loc[] slicing to subset columns from date to avg_temp_c.
    Slice in both directions at once from Hyderabad to Baghdad, and date to avg_temp_c.


In [59]:
import pandas as pd


temperatures = pd.read_csv('temperatures.csv')

temperatures_srt = temperatures.set_index(['country', 'city']).sort_index()
print(temperatures_srt.head())



# slicing to subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[('India', 'Hyderabad'): ('Iraq', 'Baghdad')])


# slicing to subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, 'date': 'avg_temp_c'])



print(temperatures_srt.loc[('India', 'Hyderabad'): ('Iraq', 'Baghdad'), 'date': 'avg_temp_c'])


                   Unnamed: 0        date  avg_temp_c
country     city                                     
Afghanistan Kabul        7260  2000-01-01       3.326
            Kabul        7261  2000-02-01       3.454
            Kabul        7262  2000-03-01       9.612
            Kabul        7263  2000-04-01      17.925
            Kabul        7264  2000-05-01      24.658
                   Unnamed: 0        date  avg_temp_c
country city                                         
India   Hyderabad        5940  2000-01-01      23.779
        Hyderabad        5941  2000-02-01      25.826
        Hyderabad        5942  2000-03-01      28.821
        Hyderabad        5943  2000-04-01      32.698
        Hyderabad        5944  2000-05-01      32.438
...                       ...         ...         ...
Iraq    Baghdad          1150  2013-05-01      28.673
        Baghdad          1151  2013-06-01      33.803
        Baghdad          1152  2013-07-01      36.392
        Baghdad          115

## Slicing time series

# *******************************************************************************************************************
Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month, and "yyyy" for year.

# Recall from Chapter 1 that you can combine multiple Boolean conditions using logical operators, such as &. To do so in one line of code, you'll need to add parentheses () around each condition.

pandas is loaded as pd and temperatures, with no index, is available.
Instructions
100 XP

    Use Boolean conditions, not .isin() or .loc[], and the full date "yyyy-mm-dd", to subset temperatures for rows in 2010 and 2011 and print the results.
    Set the index to the date column and sort it.
    Use .loc[] to subset temperatures_ind for rows in 2010 and 2011.
    Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011.


In [60]:
import pandas as pd


temperatures = pd.read_csv('temperatures.csv')
print(temperatures.head())

temperatures_bool = 


   Unnamed: 0        date     city        country  avg_temp_c
0           0  2000-01-01  Abidjan  Côte D'Ivoire      27.293
1           1  2000-02-01  Abidjan  Côte D'Ivoire      27.685
2           2  2000-03-01  Abidjan  Côte D'Ivoire      29.061
3           3  2000-04-01  Abidjan  Côte D'Ivoire      28.162
4           4  2000-05-01  Abidjan  Côte D'Ivoire      27.547
