# 1. DataFrames

What is the Point of Pandas?

    1. Data Manipulation
    2. Data Visualization 
    
Pandas is built on NumPy and Matplotlib

    Numpy: provides multidimensional array objects for easy data manipulation that pandas uses to store data 
    Matplotlib: data visualization capabilities that pandas takes advantage of 
    
    Exploring a DataFrame: 
        .head() - return first few rows 
        .info() - names of columns, the data types they contain, and whether they have any missing values.
        .shap - contains a tuple that holds the number of rows followed by the number of column (row x column)
        .describe() - summ stat
        
    Components of a DataFrame: 
        .values - contains the data values in a 2-dimensional NumPy array.
        .columns - contains column names
        .index - contains row numbers or row names. Be careful, since row labels are stored in dot-index, not in dot-rows.

### 1.1 Inspecting a DataFrame

In [3]:
'''# Print the head of the homelessness data
print(homelessness.head())

# Print information about homelessness
print(homelessness.info())

# Print the shape of homelessness
print(homelessness.shape)

# Print a description of homelessness
print(homelessness.describe())'''

'# Print the head of the homelessness data\nprint(homelessness.head())\n\n# Print information about homelessness\nprint(homelessness.info())\n\n# Print the shape of homelessness\nprint(homelessness.shape)\n\n# Print a description of homelessness\nprint(homelessness.describe())'

### 1.2 Parts of a DataFrame

In [4]:
'''# Import pandas using the alias pd
import pandas as pd 

# Print the values of homelessness
print(homelessness.values)

# Print the column index of homelessness
print(homelessness.columns)


# Print the row index of homelessness
print(homelessness.index)
'''

'# Import pandas using the alias pd\nimport pandas as pd \n\n# Print the values of homelessness\nprint(homelessness.values)\n\n# Print the column index of homelessness\nprint(homelessness.columns)\n\n\n# Print the row index of homelessness\nprint(homelessness.index)\n'

### 1.3 Sorting and Subsetting

Sorting: dogs.sort_values("weight_kg")

Sorting in descending order: dogs.sort_values("weight_kg", ascending = False)

Sorting by multiple variables: dogs.sort_values(["weight_kg", "height_cm"], ascending = [True,False])

Subsetting columns: dogs["name"]

Subsetting multiple columns: dogs[["breed", "height_cm]]

    the inner and outer square brackets are performing different tasks. The outer square brackets are responsible for               subsetting the DataFrame, and the inner square brackets are creating a list of column names to subset. 

Subsetting rows: dogs[dogs["height_cm"] > 50]

Subsetting based on text data: dogs[dogs["breed"] > "Labrador]

Subsetting based on dates: dogs[dogs["date_of_birth"] < "2015-01-01"]

Subsetting based on multiple conditions: 

    is_lab = dogs["breed"] = "Labrador"
    is_brown = dogs["color"] = "Brown"
    dogs[is_lab & is_brown]

    
Subsetting using .isin(): 


    is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
    dogs[is_black_or_brown] 

#### 1.3.1 Sorting Rows

In [4]:
'''# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows
print(homelessness_ind.head())''' 

'# Sort homelessness by individuals\nhomelessness_ind = homelessness.sort_values("individuals")\n\n# Print the top few rows\nprint(homelessness_ind.head())'

In [2]:
'''# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members", ascending = False)

# Print the top few rows
print(homelessness_fam.head())'''

'# Sort homelessness by descending family members\nhomelessness_fam = homelessness.sort_values("family_members", ascending = False)\n\n# Print the top few rows\nprint(homelessness_fam.head())'

In [6]:
'''# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region","family_members"] , ascending = [True, False])

# Print the top few rows
print(homelessness_reg_fam.head())'''

'# Sort homelessness by region, then descending family members\nhomelessness_reg_fam = homelessness.sort_values(["region","family_members"] , ascending = [True, False])\n\n# Print the top few rows\nprint(homelessness_reg_fam.head())'

#### 1.3.2 Subsetting Columns


In [8]:
'''# Select the individuals column
individuals = homelessness["individuals"]

# Print the head of the result
print(individuals.head())'''

'# Select the individuals column\nindividuals = homelessness["individuals"]\n\n# Print the head of the result\nprint(individuals.head())'

In [10]:
'''# Select the state and family_members columns
state_fam = homelessness[["state", 'family_members']]

# Print the head of the result
print(state_fam.head())'''

'# Select the state and family_members columns\nstate_fam = homelessness[["state", \'family_members\']]\n\n# Print the head of the result\nprint(state_fam.head())'

In [11]:
'''# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals", "state"]]

# Print the head of the result
print(ind_state.head())'''

'# Select only the individuals and state columns, in that order\nind_state = homelessness[["individuals", "state"]]\n\n# Print the head of the result\nprint(ind_state.head())'

#### 1.3.3 Subsetting Rows

    dogs[dogs["height_cm"] > 60]
    dogs[dogs["color"] == "tan"]
    
    dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

In [14]:
'''# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"] > 1000]

# See the result
print(ind_gt_10k)'''

'# Filter for rows where individuals is greater than 10000\nind_gt_10k = homelessness[homelessness["individuals"] > 1000]\n\n# See the result\nprint(ind_gt_10k)'

In [16]:
'''# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness["region"] == "Mountain"]

# See the result
print(mountain_reg)'''

'# Filter for rows where region is Mountain\nmountain_reg = homelessness[homelessness["region"] == "Mountain"]\n\n# See the result\nprint(mountain_reg)'

In [17]:
'''# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]

# See the result
print(fam_lt_1k_pac)'''

'# Filter for rows where family_members is less than 1000 \n# and region is Pacific\nfam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]\n\n# See the result\nprint(fam_lt_1k_pac)'

#### 1.3.4 Subsetting rows by categorical variables

    colors = ["brown", "black", "tan"]
    condition = dogs["color"].isin(colors)
    dogs[condition]

In [19]:
'''
Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic","Mid-Atlantic"])]

# See the result
print(south_mid_atlantic)
'''

'\nSubset for rows in South Atlantic or Mid-Atlantic regions\nsouth_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic","Mid-Atlantic"])]\n\n# See the result\nprint(south_mid_atlantic)\n'

In [20]:
'''
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)
'''

'\n# The Mojave Desert states\ncanu = ["California", "Arizona", "Nevada", "Utah"]\n\n# Filter for rows in the Mojave Desert states\nmojave_homelessness = homelessness[homelessness["state"].isin(canu)]\n\n# See the result\nprint(mojave_homelessness)\n'

### 1.4 New columns

    You can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

    You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding 
    columns together or by changing their units.



#### 1.4.1 Adding new columns

In [22]:
'''# Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]

# Add p_individuals col as proportion of total that are individuals
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"] 

# See the result
print(homelessness)'''

'# Add total col as sum of individuals and family_members\nhomelessness["total"] = homelessness["individuals"] + homelessness["family_members"]\n\n# Add p_individuals col as proportion of total that are individuals\nhomelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"] \n\n# See the result\nprint(homelessness)'

In [25]:
'''# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * ____ / ____ 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = ____

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = ____

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = ____

# See the result
print(result)'''

'# Create indiv_per_10k col as homeless individuals per 10k state pop\nhomelessness["indiv_per_10k"] = 10000 * ____ / ____ \n\n# Subset rows for indiv_per_10k greater than 20\nhigh_homelessness = ____\n\n# Sort high_homelessness by descending indiv_per_10k\nhigh_homelessness_srt = ____\n\n# From high_homelessness_srt, select the state and indiv_per_10k cols\nresult = ____\n\n# See the result\nprint(result)'

####  1.4.2 Combo-attack!

You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new pandas skills to find out.

In [27]:
'''

# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(result)
print(homelessness)

'''

'\n\n# Create indiv_per_10k col as homeless individuals per 10k state pop\nhomelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]\n\n# Subset rows for indiv_per_10k greater than 20\nhigh_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]\n\n# Sort high_homelessness by descending indiv_per_10k\nhigh_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)\n\n# From high_homelessness_srt, select the state and indiv_per_10k cols\nresult = high_homelessness_srt[["state", "indiv_per_10k"]]\n\n# See the result\nprint(result)\nprint(homelessness)\n\n'

Cool combination! District of Columbia has the highest number of homeless individuals - almost 54 per ten thousand people. This is almost double the number of the next-highest state, Hawaii. If you combine new column addition, row subsetting, sorting, and column selection, you can answer lots of questions like this.

# 2. Aggregating DataFrames

### 2.1 Summary Statistics

    Summarizing numerical data: dogs["height_cm"].mean() 
      .median(), .mode(), .min(), .max(), .var(), .std()
      .sum(), .cumsum()
      
    The agg method 
    
    def pct30(column)
        return column.quantile(0.3)
        
    call function - dogs["weight_kg"].agg(pct30)
    
    Summaries on multiple columns
        dogs[["weight_kg", "height_cm"]].agg(pct30)
    
      
    def pct40(column)
        return column.quantile(0.4) 
    
    dogs[["weight_kg", "height_cm"]].agg([pct30,pct40])


#### 2.1.1 Mean and Median


In [1]:
'''# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales["weekly_sales"].mean())

# Print the median of weekly_sales
print(sales["weekly_sales"].median())'''

'# Print the head of the sales DataFrame\nprint(sales.head())\n\n# Print the info about the sales DataFrame\nprint(sales.info())\n\n# Print the mean of weekly_sales\nprint(sales["weekly_sales"].mean())\n\n# Print the median of weekly_sales\nprint(sales["weekly_sales"].median())'

The mean weekly sales amount is almost double the median weekly sales amount! This can tell you that there are a few very high sales weeks that are making the mean so much higher than the median.

#### 2.1.2 Summarizing dates

Summary statistics can also be calculated on date columns that have values with the data type datetime64. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

In [2]:
'''# Print the maximum of the date column
print(sales["date"].max())

# Print the minimum of the date column
print(sales["date"].min())
'''

'# Print the maximum of the date column\nprint(sales["date"].max())\n\n# Print the minimum of the date column\nprint(sales["date"].min())\n'

Taking the minimum and maximum of a column of dates is handy for figuring out what time period your data covers. 

#### 2.1.3 Efficient summaries

The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

    # A custom IQR function
    def iqr(column):
        return column.quantile(0.75) - column.quantile(0.25)

In [3]:
'''# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales["temperature_c"].agg(iqr))'''

'# A custom IQR function\ndef iqr(column):\n    return column.quantile(0.75) - column.quantile(0.25)\n    \n# Print IQR of the temperature_c column\nprint(sales["temperature_c"].agg(iqr))'

In [4]:
'''# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg(iqr))'''

'# A custom IQR function\ndef iqr(column):\n    return column.quantile(0.75) - column.quantile(0.25)\n\n# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment\nprint(sales[["temperature_c", \'fuel_price_usd_per_l\', \'unemployment\']].agg(iqr))'

In [5]:
'''

# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr,np.median]))

'''

'\n\n# Import NumPy and create custom IQR function\nimport numpy as np\ndef iqr(column):\n    return column.quantile(0.75) - column.quantile(0.25)\n\n# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment\nprint(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr,np.median]))\n\n'

 The .agg() method makes it easy to compute multiple statistics on multiple columns, all in just one line of code.

#### 2.1.4 Cumulative Statistics

Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

In [7]:
'''
# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values("date")

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

'''

'\n# Sort sales_1_1 by date\nsales_1_1 = sales_1_1.sort_values("date")\n\n# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col\nsales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()\n\n# Get the cumulative max of weekly_sales, add as cum_max_sales col\nsales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()\n\n# See the columns you calculated\nprint(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])\n\n'

### 2.2 Counting 

    Drop duplicates
   
        vet_visits.drop_duplicates(subset = "name") 

    Drop Duplicate pairs
    
        vet_visits.drop_duplicate(subset = ["name", "breed"]) 
        
        
    Count instances 
    
           unique_dogs["breed"].value_counts()
           unique_dogs["breed"].value_counts(sort = True)
           
    Proportions
           unique_dogs["breed"].value_counts(normalize = True)

#### 2.2.1 Dropping duplicates

Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times

In [10]:
'''

# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset = ["store", "type"])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts =sales.drop_duplicates(subset = ["store", "department"])
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]==True].drop_duplicates(subset = "date")

# Print date col of holiday_dates
print(holiday_dates["date"])

'''

'\n\n# Drop duplicate store/type combinations\nstore_types = sales.drop_duplicates(subset = ["store", "type"])\nprint(store_types.head())\n\n# Drop duplicate store/department combinations\nstore_depts =sales.drop_duplicates(subset = ["store", "department"])\nprint(store_depts.head())\n\n# Subset the rows where is_holiday is True and drop duplicate dates\nholiday_dates = sales[sales["is_holiday"]==True].drop_duplicates(subset = "date")\n\n# Print date col of holiday_dates\nprint(holiday_dates["date"])\n\n'

#### 2.2.2 Counting categorical variables

Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise.

In [13]:
'''

# Count the number of stores of each type
store_counts = store_types["type"].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True)
print(store_props)

# Count the number of each department number and sort
dept_counts_sorted = store_depts["department"].value_counts(sort = True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)
print(dept_props_sorted)

'''

'\n\n# Count the number of stores of each type\nstore_counts = store_types["type"].value_counts()\nprint(store_counts)\n\n# Get the proportion of stores of each type\nstore_props = store_types["type"].value_counts(normalize=True)\nprint(store_props)\n\n# Count the number of each department number and sort\ndept_counts_sorted = store_depts["department"].value_counts(sort = True)\nprint(dept_counts_sorted)\n\n# Get the proportion of departments of each number and sort\ndept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True)\nprint(dept_props_sorted)\n\n'

### 2.3 Grouped summary statistics

    Complex: Summaries by group
    
        dogs[dogs["color"] == "Black"]["weight_kg"].mean()
        
    Grouped summaries
    
        dogs.groupby("color")["weight_kg"].mean()
        
    Multiple grouped summaries
    
        dogs.groupby("color")["weight_kg"].agg([min,max,sum])
        
    Grouping by multiple variables
        dogs.groupby(["color", "breed"])["weight_kg"].mean()

    Many groups, many summaries    
        dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()


What percent of sales occurred at each store type?

In [2]:
'''

# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

'''

'\n\n# Calc total weekly sales\nsales_all = sales["weekly_sales"].sum()\n\n# Subset for type A stores, calc total weekly sales\nsales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()\n\n# Subset for type B stores, calc total weekly sales\nsales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()\n\n# Subset for type C stores, calc total weekly sales\nsales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()\n\n# Get proportion for each type\nsales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all\nprint(sales_propn_by_type)\n\n'

Calculations with .groupby()

In [5]:
'''# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type =  sales_by_type / sum(sales_by_type )
print(sales_propn_by_type)'''

'# Group by type; calc total weekly sales\nsales_by_type = sales.groupby("type")["weekly_sales"].sum()\n\n# Get proportion for each type\nsales_propn_by_type =  sales_by_type / sum(sales_by_type )\nprint(sales_propn_by_type)'

In [7]:
'''# From previous step
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Group by type and is_holiday; calc total weekly sales
sales_by_type_is_holiday = sales.groupby(["type","is_holiday"])["weekly_sales"].sum()
print(sales_by_type_is_holiday)'''

'# From previous step\nsales_by_type = sales.groupby("type")["weekly_sales"].sum()\n\n# Group by type and is_holiday; calc total weekly sales\nsales_by_type_is_holiday = sales.groupby(["type","is_holiday"])["weekly_sales"].sum()\nprint(sales_by_type_is_holiday)'

#### 2.3.1 Multiple grouped summaries

.agg() method is useful to compute multiple statistics on multiple variables. It also works with grouped data. NumPy, which is imported as np, has many different summary statistics functions, including: np.min, np.max, np.mean, and np.median


In [9]:
'''

# Import numpy with the alias np
import numpy as np 

# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg([min,max,np.mean,np.median])

# Print sales_stats
print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")[["unemployment","fuel_price_usd_per_l"]].agg([min,max,np.mean,np.median])

# Print unemp_fuel_stats
print(unemp_fuel_stats)

'''

'\n\n# Import numpy with the alias np\nimport numpy as np \n\n# For each store type, aggregate weekly_sales: get min, max, mean, and median\nsales_stats = sales.groupby("type")["weekly_sales"].agg([min,max,np.mean,np.median])\n\n# Print sales_stats\nprint(sales_stats)\n\n# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median\nunemp_fuel_stats = sales.groupby("type")[["unemployment","fuel_price_usd_per_l"]].agg([min,max,np.mean,np.median])\n\n# Print unemp_fuel_stats\nprint(unemp_fuel_stats)\n\n'

### 2.4 Pivot tables

    Default

        dogs.pivot_table(values = "weight_kg", index = "color")

    Different statistics
    
        dogs.pivot_table(values = "weight_kg", index = "color", agg.func = np.median)
        
        
    Pivot on two variables
    
        dogs.pivot_table(values = "weight_kg", index = "color", columns = "breed")
    
    Filling missing values in pivot tables
    
         dogs.pivot_table(values = "weight_kg", index = "color", columns = "breed", fill_value = 0, margins = True)
    
    Summing with pivot tables

In [11]:
'''

# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values = "weekly_sales", index = "type")

# Print mean_sales_by_type
print(mean_sales_by_type)

'''

'\n\n# Pivot for mean weekly_sales for each store type\nmean_sales_by_type = sales.pivot_table(values = "weekly_sales", index = "type")\n\n# Print mean_sales_by_type\nprint(mean_sales_by_type)\n\n'

In [12]:
'''# Import NumPy as np
import numpy as np

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values="weekly_sales", index="type", aggfunc=[np.mean, np.median])

# Print mean_med_sales_by_type
print(mean_med_sales_by_type)'''

'# Import NumPy as np\nimport numpy as np\n\n# Pivot for mean and median weekly_sales for each store type\nmean_med_sales_by_type = sales.pivot_table(values="weekly_sales", index="type", aggfunc=[np.mean, np.median])\n\n# Print mean_med_sales_by_type\nprint(mean_med_sales_by_type)'

In [13]:
'''
# Pivot for mean weekly_sales by store type and holiday 
mean_sales_by_type_holiday = sales.pivot_table(values = "weekly_sales", index = "type", columns = "is_holiday")

# Print mean_sales_by_type_holiday
print(mean_sales_by_type_holiday)

'''

'\n# Pivot for mean weekly_sales by store type and holiday \nmean_sales_by_type_holiday = sales.pivot_table(values = "weekly_sales", index = "type", columns = "is_holiday")\n\n# Print mean_sales_by_type_holiday\nprint(mean_sales_by_type_holiday)\n\n'

Fill in missing values and sum values with pivot tables

In [15]:
'''# Print mean weekly_sales by department and type; fill missing values with 0
print(sales.pivot_table(values = "weekly_sales", index = "department", columns = "type", fill_value = 0))'''

'# Print mean weekly_sales by department and type; fill missing values with 0\nprint(sales.pivot_table(values = "weekly_sales", index = "department", columns = "type", fill_value = 0))'

In [16]:
'''# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
print(sales.pivot_table(values="weekly_sales", index="department", columns="type", fill_value = 0, margins = True))'''

'# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols\nprint(sales.pivot_table(values="weekly_sales", index="department", columns="type", fill_value = 0, margins = True))'

# 3. Slicing and Indexing Dataframes

    3.1 Setting a column as the index
    
        dogs.set_index("name")

    3.2 Removing an index
    
        dogs.reset_index()
       
    3.3 Dropping an index
    
        dogs.reset_index(drop = True)

    3.4 Indexes make subsetting simpler
    
        dogs[dogs["name"].isin(["Bella", "Stella"])] 
        
        dogs_ind.loc[["Bella", "Stella"]]

    3.5 Index values don't need to be unique
    
        dogs_ind2 = dogs.set_index("breed")

    3.6 Subsetting on duplicated index values
         
        dogs_ind2.loc["Labrador"]

    3.7 Multi-level indexes a.k.a. hierarchical indexes
    
        dogs_ind3 = dogs.set_index(["breed", "color"])

    3.8 Subset the outer level with a list
    
        dogs_ind3 = dogs_ind3.loc[["Labdrador", "Chihuahua"]]

    3.9 Subset inner levels with a list of tuples
    
        dogs_ind3 = dogs_ind3.loc[[("Labdrador", "Brown"), ("Chihuahua", "Tan")]]

    3.10 Sorting by index values
    
        dogs_ind3.sort_index()

    3.11 Controlling sort_index
    
        dogs_ind3.sort_index(level = ["color", "breed"], ascending = [True, False])

    



Pandas allows you to designate columns as an index. This enables cleaner code when taking subsets (as well as providing more efficient lookup under some circumstances).

In [17]:
'''

# Look at temperatures
print(temperatures)

# Set the index of temperatures to city
temperatures_ind = temperatures.set_index("city")

# Look at temperatures_ind
print(temperatures_ind)

# Reset the temperatures_ind index, keeping its contents
print(temperatures_ind.reset_index())

# Reset the temperatures_ind index, dropping its contents
print(temperatures_ind.reset_index(drop = True))

'''

'\n\n# Look at temperatures\nprint(temperatures)\n\n# Set the index of temperatures to city\ntemperatures_ind = temperatures.set_index("city")\n\n# Look at temperatures_ind\nprint(temperatures_ind)\n\n# Reset the temperatures_ind index, keeping its contents\nprint(temperatures_ind.reset_index())\n\n# Reset the temperatures_ind index, dropping its contents\nprint(temperatures_ind.reset_index(drop = True))\n\n'

#### 3.1 Subsetting with .loc[]

In [19]:
'''

# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]

# Subset temperatures using square brackets
print(temperatures[temperatures["city"].isin(cities)])

# Subset temperatures_ind using .loc[]
print(temperatures_ind.loc[cities])

'''

'\n\n# Make a list of cities to subset on\ncities = ["Moscow", "Saint Petersburg"]\n\n# Subset temperatures using square brackets\nprint(temperatures[temperatures["city"].isin(cities)])\n\n# Subset temperatures_ind using .loc[]\nprint(temperatures_ind.loc[cities])\n\n'

In [21]:
'''# Index temperatures by country & city
temperatures_ind = temperatures.set_index(["country", "city"])

# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]

# Subset for rows to keep
print(temperatures_ind.loc[rows_to_keep])'''

'# Index temperatures by country & city\ntemperatures_ind = temperatures.set_index(["country", "city"])\n\n# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore\nrows_to_keep = [("Brazil", "Rio De Janeiro"), ("Pakistan", "Lahore")]\n\n# Subset for rows to keep\nprint(temperatures_ind.loc[rows_to_keep])'

Magnificent multi-level indexing! Multi-level indexes can make it easy to comprehend your dataset when one category is nested inside another category.

#### 3.1.1 Sorting by index values

In [22]:
'''# Sort temperatures_ind by index values
print(temperatures_ind.sort_index())

# Sort temperatures_ind by index values at the city level
print(temperatures_ind.sort_index(level = "city"))

# Sort temperatures_ind by country then descending city
print(temperatures_ind.sort_index(level = ["country", "city"], ascending = [True, False]))'''

'# Sort temperatures_ind by index values\nprint(temperatures_ind.sort_index())\n\n# Sort temperatures_ind by index values at the city level\nprint(temperatures_ind.sort_index(level = "city"))\n\n# Sort temperatures_ind by country then descending city\nprint(temperatures_ind.sort_index(level = ["country", "city"], ascending = [True, False]))'

### 3.2 Slicing and subsetting with .loc and .iloc

    A. LOC

    3.2.1 Sort the index before you slice
    
        dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
        print(dogs_srt)
    
    3.2.2 Slicing the outer index level which cannot be done in the inner index
    
        dogs_srt.loc["Chow Chow": "Poodle"] 
    
    3.2.3 Slicing the inner index level 
    
        dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]
        
    3.2.4 Slice twice         
    
        dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey"), ("name":"heigh_cm")]
        
        
    3.2.5 Slicing by dates
    
        dogs_srt.loc['2014-08-25:"2016-09-16"]
             
    3.2.6 Slicing by partial dates 
    
        dogs_srt.loc["2014":"2016"]
        
    
    B. ILOC
    
    Subsetting by row/column number
    
       

In [24]:
'''# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Russia
print(temperatures_srt.loc["Pakistan":"Russia"])

# Try to subset rows from Lahore to Moscow
print(temperatures_srt.loc["Lahore":"Moscow"])

# Subset rows from Pakistan, Lahore to Russia, Moscow
print(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia","Moscow")])'''

'# Sort the index of temperatures_ind\ntemperatures_srt = temperatures_ind.sort_index()\n\n# Subset rows from Pakistan to Russia\nprint(temperatures_srt.loc["Pakistan":"Russia"])\n\n# Try to subset rows from Lahore to Moscow\nprint(temperatures_srt.loc["Lahore":"Moscow"])\n\n# Subset rows from Pakistan, Lahore to Russia, Moscow\nprint(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia","Moscow")])'

#### 3.2.4 Slicing in both directions

    Slice rows with code like df.loc[("a", "b"):("c", "d")].
    
    Slice columns with code like df.loc[:, "e":"f"].
    
    Slice both ways with code like df.loc[("a", "b"):("c", "d"), "e":"f"].


In [26]:
'''# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])

# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, "date":"avg_temp_c"])

# Subset in both directions at once
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c"])'''

'# Subset rows from India, Hyderabad to Iraq, Baghdad\nprint(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])\n\n# Subset columns from date to avg_temp_c\nprint(temperatures_srt.loc[:, "date":"avg_temp_c"])\n\n# Subset in both directions at once\nprint(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c"])'

#### 3.2.5 Slicing Time Series

    Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use .loc[] to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, "yyyy-mm-dd" for year-month-day, "yyyy-mm" for year-month, and "yyyy" for year.

In [28]:
'''

# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")]
print(temperatures_bool)

# Set date as the index and sort the index
temperatures_ind = temperatures.set_index("date").sort_index()

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08":"2011-02"])

'''

'\n\n# Use Boolean conditions to subset temperatures for rows in 2010 and 2011\ntemperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")]\nprint(temperatures_bool)\n\n# Set date as the index and sort the index\ntemperatures_ind = temperatures.set_index("date").sort_index()\n\n# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011\nprint(temperatures_ind.loc["2010":"2011"])\n\n# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011\nprint(temperatures_ind.loc["2010-08":"2011-02"])\n\n'

#### 3.2.6 Subsetting by row/column number

In [30]:
'''# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22,1])

# Use slicing to get the first 5 rows
print(temperatures.iloc[:5])

# Use slicing to get columns 3 to 4
print(temperatures.iloc[:,2:4])

# Use slicing in both directions at once
print(temperatures.iloc[:5,2:4])'''

'# Get 23rd row, 2nd column (index 22, 1)\nprint(temperatures.iloc[22,1])\n\n# Use slicing to get the first 5 rows\nprint(temperatures.iloc[:5])\n\n# Use slicing to get columns 3 to 4\nprint(temperatures.iloc[:,2:4])\n\n# Use slicing in both directions at once\nprint(temperatures.iloc[:5,2:4])'

### 3.3 Working with Pivot Tables

    3.3.1 - .loc[] + slicing is a power combo
    
        dogs_height_by_breed_vs_color.loc["Chow Chow": "Poodle"]
        
    3.3.2 The axis argument - summ stats across rows 
        
        dogs_height_by_breed_vs_color.mean(axis = "index")
        
        dogs_height_by_breed_vs_color.mean(axis = "columns")

#### 3.3.1 Pivot temperature by city and year

    You can access the components of a date (year, month and day) using code of the form dataframe["column"].dt.component. For example, the month component is dataframe["column"].dt.month, and the year component is dataframe["column"].dt.year

In [31]:
'''

# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year

# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table(values = "avg_temp_c", index =["country", "city"], columns = "year")

# See the result
print(temp_by_country_city_vs_year)

'''

'\n\n# Add a year column to temperatures\ntemperatures["year"] = temperatures["date"].dt.year\n\n# Pivot avg_temp_c by country and city vs year\ntemp_by_country_city_vs_year = temperatures.pivot_table(values = "avg_temp_c", index =["country", "city"], columns = "year")\n\n# See the result\nprint(temp_by_country_city_vs_year)\n\n'

#### 3.3.2 Subsetting pivot tables

    A pivot table is just a DataFrame with sorted indexes, so the techniques you have learned already can be used to subset them. In particular, the .loc[] + slicing combination is often helpful.

In [35]:
'''

# Subset for Egypt to India
temp_by_country_city_vs_year.loc["Egypt":"India"]

# Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi")]

# Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi"), "2005": "2010"]

'''

'\n\n# Subset for Egypt to India\ntemp_by_country_city_vs_year.loc["Egypt":"India"]\n\n# Subset for Egypt, Cairo to India, Delhi\ntemp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi")]\n\n# Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010\ntemp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi"), "2005": "2010"]\n\n'

#### 3.3.3 Calculating on a pivot table

Find the rows or columns where the highest or lowest value occurs.

Recall from Chapter 1 that you can easily subset a Series or DataFrame to find rows of interest using a logical condition inside of square brackets. For example: series[series > value].

In [36]:
'''

# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")

# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])

'''

'\n\n# Get the worldwide mean temp by year\nmean_temp_by_year = temp_by_country_city_vs_year.mean()\n\n# Filter for the year that had the highest mean temp\nprint(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])\n\n# Get the mean temp by city\nmean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")\n\n# Filter for the city that had the lowest mean temp\nprint(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])\n\n'

## 4. Creating and Visualizing DataFrames

4.1 Visualizing your data 

    import matplotlib.pyplot as plt
    
    4.1.1 Histogram 
       x["var"].hist(bins=5)
       plt.show()
       
    4.1.2 Barplots
    
        avg_weight_by_breed = dog_pack.groupby("breed")["weight_kg"].mean()
        print(avg_weight_by_breed)
        
        avg_weight_by_breed.plot(kind = "bar", title = "   ")
        plt.show
        
    4.1.3 Lineplot
    
        sully.plot(x = "date", y = "weight_kg", kind = "line", rot = 45)
        plt.show()
        
    
    4.1.4 Scatterplot
    
        sully.plot(x = "date", y = "weight_kg", kind = "scatter")
        
    
    4.1.5 Layering plots
    
        dog_pack[dog_pack["sex"] == "F"]["height_cm"].hist.hist(alpha = 0.7)
        dog_pack[dog_pack["sex"] == "M"]["height_cm"].hist.hist(alpha = 0.7)
        plt.legend(["F", "M"])
        plt.show()
    
   
   

#### 4.1.1 Which avocado size is most popular?

Bar plots are great for revealing relationships between categorical (size) and numeric (number sold) variables, but you'll often have to manipulate your data first in order to get the numbers you need for plotting.


In [37]:
'''# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Look at the first few rows of data
print(avocados.head())

# Get the total number of avocados sold of each size
nb_sold_by_size = avocados.groupby("size")["nb_sold"].sum()


# Create a bar plot of the number of avocados sold by size
nb_sold_by_size.plot(kind="bar")

# Show the plot
plt.show()'''

'# Import matplotlib.pyplot with alias plt\nimport matplotlib.pyplot as plt\n\n# Look at the first few rows of data\nprint(avocados.head())\n\n# Get the total number of avocados sold of each size\nnb_sold_by_size = avocados.groupby("size")["nb_sold"].sum()\n\n\n# Create a bar plot of the number of avocados sold by size\nnb_sold_by_size.plot(kind="bar")\n\n# Show the plot\nplt.show()'

#### 4.1.2 Changes in sales over time

Line plots are designed to visualize the relationship between two numeric variables, where each data values is connected to the next one. They are especially useful for visualizing the change in a number over time since each time point is naturally connected to the next time point. In this exercise, you'll visualize the change in avocado sales over three years.

In [38]:
'''# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Get the total number of avocados sold on each date
nb_sold_by_date = avocados.groupby("date")["nb_sold"].sum()

# Create a line plot of the number of avocados sold by date
nb_sold_by_date.plot(kind = "line")

# Show the plot
plt.show()'''

'# Import matplotlib.pyplot with alias plt\nimport matplotlib.pyplot as plt\n\n# Get the total number of avocados sold on each date\nnb_sold_by_date = avocados.groupby("date")["nb_sold"].sum()\n\n# Create a line plot of the number of avocados sold by date\nnb_sold_by_date.plot(kind = "line")\n\n# Show the plot\nplt.show()'

#### 4.1.3 Avocado supply and demand

Scatter plots are ideal for visualizing relationships between numerical variables. In this exercise, you'll compare the number of avocados sold to average price and see if they're at all related. If they're related, you may be able to use one number to predict the other

In [40]:
'''# Scatter plot of avg_price vs. nb_sold with title
avocados.plot(x= "nb_sold", y = "avg_price", kind = "scatter", title = "Number of avocados sold vs. average price")

# Show the plot
plt.show()'''

'# Scatter plot of avg_price vs. nb_sold with title\navocados.plot(x= "nb_sold", y = "avg_price", kind = "scatter", title = "Number of avocados sold vs. average price")\n\n# Show the plot\nplt.show()'

#### 4.1.4 Price of conventional vs. organic avocados

Creating multiple plots for different subsets of data allows you to compare groups. In this exercise, you'll create multiple histograms to compare the prices of conventional and organic avocados.

In [41]:
'''

# Modify bins to 20
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins = 20)

# Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins = 20)

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

'''

'\n\n# Modify bins to 20\navocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins = 20)\n\n# Modify bins to 20\navocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins = 20)\n\n# Add a legend\nplt.legend(["conventional", "organic"])\n\n# Show the plot\nplt.show()\n\n'

On average, organic avocados are more expensive than conventional ones, but their price distributions have some overlap.

### 4.2 MISSING VALUES 

    4.2.1 Detecting missing values
    
        dogs.isna() - Boolean for every single value indicating whether the value is missing or not, but this isn't very helpful when you're working with a lot of data.

        dogs.isna().any() - one value for each variable that tells us if there are any missing values in that column.
    
    
    4.2.1 Counting missing values

        dogs.isna().sum()
    
    4.2.2 Plotting missing values
    
        dogs.isna().sum().plot(kind = "bar")
        plt.show()
        
    4.2.3 Removing missing values 
    
        dogs.dropna()
        dogs.fillna(0)


#### 4.2.1 Finding missing values

Missing values are everywhere, and you don't want them interfering with your work. Some functions ignore missing data by default, but that's not always the behavior you might want. Some functions can't handle missing values at all, so these values need to be taken care of before you can use them. If you don't know where your missing values are, or if they exist, you could make mistakes in your analysis. In this exercise, you'll determine if there are missing values in the dataset, and if so, how many.

In [43]:
'''

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Check individual values for missing values
print(avocados_2016.isna())

# Check each column for missing values

print(avocados_2016.isna().any())
# Bar plot of missing values by variable
print(avocados_2016.isna().sum().plot(kind = "bar"))

# Show plot
plt.show()

'''

'\n\n# Import matplotlib.pyplot with alias plt\nimport matplotlib.pyplot as plt\n\n# Check individual values for missing values\nprint(avocados_2016.isna())\n\n# Check each column for missing values\n\nprint(avocados_2016.isna().any())\n# Bar plot of missing values by variable\nprint(avocados_2016.isna().sum().plot(kind = "bar"))\n\n# Show plot\nplt.show()\n\n'

missing values in the small_sold, large_sold, and xl_sold columns.

#### 4.2.2 Removing missing values

One way is to remove them from the dataset completely. 


In [45]:
'''# Remove rows with missing values
avocados_complete = avocados_2016.dropna()

# Check if any columns contain missing values
print(avocados_complete.isna().any())'''

'# Remove rows with missing values\navocados_complete = avocados_2016.dropna()\n\n# Check if any columns contain missing values\nprint(avocados_complete.isna().any())'

Removing observations with missing values is a quick and dirty way to deal with missing data, but this can introduce bias to your data if the values are not missing at random.

#### 4.2.3 Replacing missing values

For numerical variables, one option is to replace values with 0— you'll do this here. However, when you replace missing values, you make assumptions about what a missing value means. In this case, you will assume that a missing number sold means that no sales for that avocado type were made that week.

In [46]:
'''# List the columns with missing values
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]

# Create histograms showing the distributions cols_with_missing
avocados_2016[cols_with_missing].hist()

# Show the plot
plt.show()'''

'# List the columns with missing values\ncols_with_missing = ["small_sold", "large_sold", "xl_sold"]\n\n# Create histograms showing the distributions cols_with_missing\navocados_2016[cols_with_missing].hist()\n\n# Show the plot\nplt.show()'

In [47]:
'''# From previous step
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]
avocados_2016[cols_with_missing].hist()
plt.show()

# Fill in missing values with 0
avocados_filled = avocados_2016.fillna(0)

# Create histograms of the filled columns
avocados_filled[cols_with_missing].hist()

# Show the plot
plt.show()'''

'# From previous step\ncols_with_missing = ["small_sold", "large_sold", "xl_sold"]\navocados_2016[cols_with_missing].hist()\nplt.show()\n\n# Fill in missing values with 0\navocados_filled = avocados_2016.fillna(0)\n\n# Create histograms of the filled columns\navocados_filled[cols_with_missing].hist()\n\n# Show the plot\nplt.show()'

### 4.3 Creating DataFrames 

Dictionaries

    my_dict = {"key1": value1}
    my_dict["key1"]
    
    
Creating Dataframes

    From a list of dictionaries: row by row
    
        list_of_dicts = [{"name": "Ginger" , "breed": "Dashshund"},{"name": "Scout" , "breed": "Dalmatian"}]
        new_dogs = pd.DataFrame(list_of_dicts)
        print(new_dogs)

    From a dictionary of lists: column by column 
    
    dict_of_lists = {"name": ["Ginger", "Scout"], "breed": ["Dachshund", "Dalmatian"]}
    

#### 4.3.1 List of dictionaries

In [51]:
'''# Create a list of dictionaries with new data
avocados_list = [
    {"date": "2019-11-03", "small_sold": 10376832, 'large_sold': 7835071},
    {"date": "2019-11-10", "small_sold": 10717154, 'large_sold': 8561348},
]

# Convert list into DataFrame
avocados_2019 = pd.DataFrame(avocados_list)

# Print the new DataFrame
print(avocados_2019)'''

'# Create a list of dictionaries with new data\navocados_list = [\n    {"date": "2019-11-03", "small_sold": 10376832, \'large_sold\': 7835071},\n    {"date": "2019-11-10", "small_sold": 10717154, \'large_sold\': 8561348},\n]\n\n# Convert list into DataFrame\navocados_2019 = pd.DataFrame(avocados_list)\n\n# Print the new DataFrame\nprint(avocados_2019)'

#### 4.3.2 Dictionaries of List

In [55]:
'''
# Create a dictionary of lists with new data
avocados_dict = {
  "date": ["2019-11-17", "2019-12-01" ],
  "small_sold": [10859987, 9291631],
  "large_sold": [7674135, 6238096]
}

# Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)

# Print the new DataFrame
print(avocados_2019)
'''

'\n# Create a dictionary of lists with new data\navocados_dict = {\n  "date": ["2019-11-17", "2019-12-01" ],\n  "small_sold": [10859987, 9291631],\n  "large_sold": [7674135, 6238096]\n}\n\n# Convert dictionary into DataFrame\navocados_2019 = pd.DataFrame(avocados_dict)\n\n# Print the new DataFrame\nprint(avocados_2019)\n'

## 4.4 Reading and writing CSVs



CSV, or comma-separated values, is a common data storage file type. It's designed to store tabular data, just like a pandas DataFrame. It's a text file, where each row of data has its own line, and each value is separated by a comma.
    
    CSV to DataFrame
    
        import pandas as pd

        new_dogs = pd.read_csv("new_dogs.csv")

    DataFrame manipulation
    
        new_dogs["bmi"] = new_dogs["weight]/(ew_dogs[height]/100)**2
    
    DataFrame to CSV
    
        new_dogs.to_csv("new_dogs_with_bmi.csv")

#### 4.4.1 CSV to DataFrame

You work for an airline, and your manager has asked you to do a competitive analysis and see how often passengers flying on other airlines are involuntarily bumped from their flights. You got a CSV file (airline_bumping.csv) from the Department of Transportation containing data on passengers that were involuntarily denied boarding in 2016 and 2017, but it doesn't have the exact numbers you want.

In [57]:
'''

# From previous steps
    airline_bumping = pd.read_csv("airline_bumping.csv")
    print(airline_bumping.head())
    airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()
    airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000

# Print airline_totals
    print(airline_totals)

'''

'\n\n# From previous steps\n    airline_bumping = pd.read_csv("airline_bumping.csv")\n    print(airline_bumping.head())\n    airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()\n    airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000\n\n# Print airline_totals\n    print(airline_totals)\n\n'

#### 4.4.2 DataFrame to CSV  

In [59]:
'''
# Create airline_totals_sorted
    airline_totals_sorted = airline_totals.sort_values("bumps_per_10k", ascending = False)

# Print airline_totals_sorted
    print(airline_totals_sorted)

# Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv("airline_totals_sorted.csv")

'''

'\n# Create airline_totals_sorted\n    airline_totals_sorted = airline_totals.sort_values("bumps_per_10k", ascending = False)\n\n# Print airline_totals_sorted\n    print(airline_totals_sorted)\n\n# Save as airline_totals_sorted.csv\nairline_totals_sorted.to_csv("airline_totals_sorted.csv")\n\n'