# Creating and Visualizing DataFrames

## Visualizing your data

# Which avocado size is most popular?

**Exercise**

_Which avocado size is most popular?_

Avocados are increasingly popular and delicious in guacamole and on toast. The Hass Avocado Board keeps track of avocado supply and demand across the USA, including the sales of three different sizes of avocado. In this exercise, you'll use a bar plot to figure out which size is the most popular.

Bar plots are great for revealing relationships between categorical (size) and numeric (number sold) variables, but you'll often have to manipulate your data first in order to get the numbers you need for plotting.

pandas has been imported as pd, and avocados is available.

**Instructions**

- Print the head of the avocados dataset. What columns are available?

#Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

#Look at the first few rows of data
print(avocados.head())

- For each avocado size group, calculate the total number sold, storing as nb_sold_by_size.

#Get the total number of avocados sold of each size
nb_sold_by_size = avocados.groupby("size")["nb_sold"].sum()

- Create a bar plot of the number of avocados sold by size.

#Create a bar plot of the number of avocados sold by size
nb_sold_by_size.plot(x="size", y="sold", kind="bar",title="Most popular")

- Show the plot.

#Show the plot
plt.show()

![bar](bar.png)

Bedazzling bar plot! 
It looks like small avocados were the most-purchased size, but large avocados were a close second.


# Changes in sales over time

**Exercise**

_Changes in sales over time_

Line plots are designed to visualize the relationship between two numeric variables, where each data values is connected to the next one. They are especially useful for visualizing the change in a number over time since each time point is naturally connected to the next time point. In this exercise, you'll visualize the change in avocado sales over three years.

pandas has been imported as pd, and avocados is available.

**Instructions**

- Get the total number of avocados sold on each date. The DataFrame has two rows for each date—one for organic, and one for conventional. Save this as nb_sold_by_date.

#Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

#et _the total number of avocados sold on each date_
nb_sold_by_date = avocados.groupby("date")["nb_sold"].sum()

- Create a line plot of the number of avocados sold.

#Create a line plot of the number of avocados sold by date
nb_sold_by_date.plot(x="organic", y="conventional", kind="line", title="The number of avocados sold")

- Show the plot.

#Show the plot
plt.show()

ovely line plot! 
Line plots are great for visualizing something over time. Here, it looks like the number of avocados spikes around the same time each year.

# Avocado supply and demand

**Exercise**

_Avocado supply and demand_

_Scatter plots are ideal for visualizing relationships between numerical variables._ In this exercise, you'll compare the number of avocados sold to average price and see if they're at all related. If they're related, you may be able to use one number to predict the other.

matplotlib.pyplot has been imported as plt, pandas has been imported as pd, and avocados is available.

**Instructions**

- Create a scatter plot with nb_sold on the x-axis and avg_price on the y-axis. Title it "Number of avocados sold vs. average price".

#Scatter plot of avg_price vs. nb_sold with title
avocados.plot(x="nb_sold", y="avg_price", kind="scatter", title="Number of avocados sold vs. average price")


- Show the plot.

#Show the plot
plt.show()

Super scatter plot! 
It looks like when more avocados are sold, prices go down. However, this doesn't mean that fewer sales causes higher prices - we can only tell that they're correlated with each other.

# Price of conventional vs. organic avocados

**Exercise**

_Price of conventional vs. organic avocados_

Creating multiple plots for different subsets of data allows you to compare groups. In this exercise, you'll create multiple histograms to compare the prices of conventional and organic avocados.

matplotlib.pyplot has been imported as plt and pandas has been imported as pd.

**Instructions 1/3**

- Subset avocados for the conventional type, and the average price column. Create a histogram.

#Histogram of conventional avg_price avocados[avocados["type"] == "conventional"]["avg_price"].hist()

- Create a histogram of avg_price for organic type avocados.

#Histogram of organic avg_price
avocados[avocados["type"] == "organic"]["avg_price"].hist()

- Add a legend to your plot, with the names "conventional" and "organic".

#Add a legend
plt.legend(["conventional" , "organic"])

- Show your plot.

#Show the plot
plt.show()

**Instructions 2/3**

- Modify your code to adjust the transparency of both histograms to 0.5 to see how much overlap there is between the two distributions.

#Modify histogram transparency to 0.5 
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5)

#Modify histogram transparency to 0.5
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5)

#Add a legend
plt.legend(["conventional", "organic"])

#Show the plot
plt.show()

**Instructions 3/3**

- Modify your code to use 20 bins in both histograms.

#Modify bins to 20
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins=20)

#Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins=20)

#Add a legend
plt.legend(["conventional", "organic"])

#Show the plot
plt.show()

Great layering! 
We can see that on average, organic avocados are more expensive than conventional ones, but their price distributions have some overlap.

# Missing values

## Finding missing values

**Exercise**

_Finding missing values_

Missing values are everywhere, and you don't want them interfering with your work. Some functions ignore missing data by default, but that's not always the behavior you might want. Some functions can't handle missing values at all, so these values need to be taken care of before you can use them. If you don't know where your missing values are, or if they exist, you could make mistakes in your analysis. In this exercise, you'll determine if there are missing values in the dataset, and if so, how many.

pandas has been imported as pd and avocados_2016, a subset of avocados that contains only sales from 2016, is available.

**Instructions**

- Print a DataFrame that shows whether each value in avocados_2016 is missing or not.

#Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

#Check individual values for missing values
print(avocados_2016.isna())

<script.py> output:

         date  avg_price  total_sold  small_sold  large_sold  xl_sold  total_bags_sold  small_bags_sold  large_bags_sold  xl_bags_sold
    0   False      False       False       False       False    False            False            False            False         False
    1   False      False       False       False       False    False            False            False            False         False
    2   False      False       False       False        True    False            False            False            False         False
    3   False      False       False       False       False    False            False            False            False         False
    4   False      False       False       False       False     True            False            False            False         False
    5   False      False       False        True       False    False            False            False            False         False
    6   False      False       False       False       False    False            False            False            False         False
    7   False      False       False       False        True    False            False            False            False         False
    8   False      False       False       False       False    False            False            False            False         False
    9   False      False       False       False       False    False            False            False            False         False
    10  False      False       False       False        True    False            False            False            False         False
    11  False      False       False       False       False    False            False            False            False         False
    12  False      False       False       False       False    False            False            False            False         False
    13  False      False       False       False       False    False            False            False            False         False
    14  False      False       False       False       False    False            False            False            False         False
    15  False      False       False       False        True    False            False            False            False         False
    16  False      False       False       False       False     True            False            False            False         False
    17  False      False       False       False       False    False            False            False            False         False
    18  False      False       False       False       False    False            False            False            False         False
    19  False      False       False       False        True    False            False            False            False         False
    20  False      False       False       False       False    False            False            False            False         False
    21  False      False       False       False       False    False            False            False            False         False
    22  False      False       False       False       False    False            False            False            False         False
    23  False      False       False       False       False    False            False            False            False         False
    24  False      False       False       False       False    False            False            False            False         False
    25  False      False       False       False       False    False            False            False            False         False
    26  False      False       False       False       False    False            False            False            False         False
    27  False      False       False       False       False    False            False            False            False         False
    28  False      False       False       False       False    False            False            False            False         False
    29  False      False       False       False       False    False            False            False            False         False
    30  False      False       False       False       False     True            False            False            False         False
    31  False      False       False       False       False    False            False            False            False         False
    32  False      False       False       False       False     True            False            False            False         False
    33  False      False       False       False       False    False            False            False            False         False
    34  False      False       False       False       False    False            False            False            False         False
    35  False      False       False       False       False    False            False            False            False         False
    36  False      False       False        True       False    False            False            False            False         False
    37  False      False       False       False        True    False            False            False            False         False
    38  False      False       False       False       False    False            False            False            False         False
    39  False      False       False       False       False    False            False            False            False         False
    40  False      False       False        True       False    False            False            False            False         False
    41  False      False       False       False       False    False            False            False            False         False
    42  False      False       False       False       False    False            False            False            False         False
    43  False      False       False       False       False    False            False            False            False         False
    44  False      False       False        True       False    False            False            False            False         False
    45  False      False       False       False       False    False            False            False            False         False
    46  False      False       False       False       False    False            False            False            False         False
    47  False      False       False       False       False    False            False            False            False         False
    48  False      False       False       False       False    False            False            False            False         False
    49  False      False       False       False       False    False            False            False            False         False
    50  False      False       False        True       False    False            False            False            False         False
    51  False      False       False        True       False    False            False            False            False         False

- Print a summary that shows whether any value in each column is missing or not.

#Check each column for missing values
print(avocados_2016.isna().any())

<script.py> output:

    date               False
    avg_price          False
    total_sold         False
    small_sold          True
    large_sold          True
    xl_sold             True
    total_bags_sold    False
    small_bags_sold    False
    large_bags_sold    False
    xl_bags_sold       False
    dtype: bool

- Create a bar plot of the total number of missing values in each column.

#Bar plot of missing values by variable
import matplotlib.pyplot as plt
avocados_2016.isna().sum().plot(kind="bar")

#Show plot
plt.show()


## _Removing missing values_

**Exercise**

_Removing missing values_

Now that you know there are some missing values in your DataFrame, you have a few options to deal with them. One way is to remove them from the dataset completely. In this exercise, you'll remove missing values by removing all rows that contain missing values.

pandas has been imported as pd and avocados_2016 is available.

**Instructions**

- Remove the rows of avocados_2016 that contain missing values and store the remaining rows in avocados_complete.

#Remove rows with missing values
avocados_complete = avocados_2016.dropna()


- Verify that all missing values have been removed from avocados_complete. Calculate each column that has NAs and print.

#Check if any columns contain missing values
print(avocados_complete.isna().any())

<script.py> output:

    date               False
    avg_price          False
    total_sold         False
    small_sold         False
    large_sold         False
    xl_sold            False
    total_bags_sold    False
    small_bags_sold    False
    large_bags_sold    False
    xl_bags_sold       False
    dtype: bool
    

Delightful dropping! 
Removing observations with missing values is a quick and dirty way to deal with missing data, but this can introduce bias to your data if the values are not missing at random.

## _Replacing missing values_

**Exercise**

_Replacing missing values_

Another way of handling missing values is to replace them _all with the same value_. For numerical variables, one option is to replace values with 0— you'll do this here. 
However, when you replace missing values, you make assumptions about what a missing value means. 
In this case, you will assume that a missing number sold means that no sales for that avocado type were made that week.

In this exercise, you'll see how replacing missing values can affect the distribution of a variable using histograms. You can plot histograms for multiple variables at a time as follows:

dogs[["height_cm", "weight_kg"]].hist()

pandas has been imported as pd and matplotlib.pyplot has been imported as plt. The avocados_2016 dataset is available.

**Instructions 1/2**

- A list has been created, cols_with_missing, containing the names of columns with missing values: "small_sold", "large_sold", and "xl_sold".

#List the columns with missing values
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]

- Create a histogram of those columns.

#Create histograms showing the distributions cols_with_missing
avocados_2016[cols_with_missing].hist()

- Show the plot.

#Show the plot
plt.show()


**Instructions 2/2**

- Replace the missing values of avocados_2016 with 0s and store the result as avocados_filled.

#From previous step
cols_with_missing = ["small_sold", "large_sold", "xl_sold"]

avocados_2016[cols_with_missing].hist()

plt.show()

#Fill in missing values with 0
avocados_filled = avocados_2016.fillna(0)

- Create a histogram of _the cols_with_missing_ columns of avocados_filled.

#Create histograms of the filled columns
avocados_filled[cols_with_missing].hist()
plt.show()

#Show the plot
plt.show()

Fabulous filling! 
Notice how the distribution has changed shape after replacing missing values with zeros.


# Creating DataFrames

## _List of dictionaries_

**Exercise**

_List of dictionaries_

You recently got some new avocado data from 2019 that you'd like to put in a DataFrame using the list of dictionaries method. Remember that with this method, you go through the data row by row.

date	small_sold	large_sold
"2019-11-03"	10376832	7835071
"2019-11-10"	10717154	8561348
pandas as pd is imported.

**Instructions**

- Create a list of dictionaries with the new data called avocados_list.

#Create a list of dictionaries with new data

avocados_list = [
    {"date": "2019-11-03", "small_sold": 10376832, "large_sold": 7835071},
    {"date": "2019-11-10", "small_sold": 10717154, "large_sold": 8561348},
]

- Convert the list into a DataFrame called avocados_2019.

#Convert list into DataFrame

avocados_2019 = pd.DataFrame(avocados_list)


- Print your new DataFrame.

#Print the new DataFrame
print(avocados_2019)

<script.py> output:

             date  small_sold  large_sold
    0  2019-11-03    10376832     7835071
    1  2019-11-10    10717154     8561348
 
Lovely work with the list-of-dictionaries! 
The list-of-dictionaries method creates DataFrames row-by-row.

## Dictionary of lists

**Exercise**

_Dictionary of lists_

Some more data just came in! This time, you'll use the dictionary of lists method, parsing the data column by column.

date	small_sold	large_sold
"2019-11-17"	10859987	7674135
"2019-12-01"	9291631	6238096
pandas as pd is imported.

**Instructions**

- Create a dictionary of lists with the new data called avocados_dict.

#Create a dictionary of lists with new data
avocados_dict = {
  "date": ["2019-11-17", "2019-12-01"],
  "small_sold": [10859987, 9291631],
  "large_sold": [7674135,6238096]
}

- Convert the dictionary to a DataFrame called avocados_2019.

#Convert dictionary into DataFrame
avocados_2019 = pd.DataFrame(avocados_dict)

- Print your new DataFrame

#Print the new DataFrame
print(avocados_2019)

<script.py> output:

             date  small_sold  large_sold
    0  2019-11-17    10859987     7674135
    1  2019-12-01     9291631     

Delightful dictionary-of-lists usage! 
The list-of-dictionaries method creates DataFrames column-by-column.

## CSV to DataFrame

**Exercise**

_CSV to DataFrame_

You work for an airline, and your manager has asked you to do a competitive analysis and see how often passengers flying on other airlines are involuntarily bumped from their flights. You got a CSV file (airline_bumping.csv) from the Department of Transportation containing data on passengers that were involuntarily denied boarding in 2016 and 2017, but it doesn't have the exact numbers you want. In order to figure this out, you'll need to get the CSV into a pandas DataFrame and do some manipulation!

pandas is imported for you as pd. "airline_bumping.csv" is in your working directory.

**Instructions 1/4**

- _Read the CSV file_ "airline_bumping.csv" and store it as a DataFrame called airline_bumping.

#Read CSV as DataFrame called airline_bumping
airline_bumping = pd.read_csv("airline_bumping.csv")

- Print the first few rows of airline_bumping

#Take a look at the DataFrame
print(airline_bumping.head())

<script.py> output:

                 airline  year  nb_bumped  total_passengers
    0    DELTA AIR LINES  2017        679          99796155
    1     VIRGIN AMERICA  2017        165           6090029
    2    JETBLUE AIRWAYS  2017       1475          27255038
    3    UNITED AIRLINES  2017       2067          70030765
    4  HAWAIIAN AIRLINES  2017         92           8422734
    

**Instructions 2/4**

- For each airline group, select the nb_bumped, and total_passengers columns, and calculate the sum (for both years). Store this as airline_totals.

#  From previous step
airline_bumping = pd.read_csv("airline_bumping.csv")
print(airline_bumping.head())

#For each airline, select nb_bumped and total_passengers and sum
airline_totals = airline_bumping.groupby("airline")[["nb_bumped" , "total_passengers"]].sum

<script.py> output:

             airline  year  nb_bumped  total_passengers
0    DELTA AIR LINES  2017        679          99796155
1     VIRGIN AMERICA  2017        165           6090029
2    JETBLUE AIRWAYS  2017       1475          27255038
3    UNITED AIRLINES  2017       2067          70030765
4  HAWAIIAN AIRLINES  2017         92           8422734

**Instructions 3/4**

- Create a new column of airline_totals called bumps_per_10k, which is the number of passengers bumped per 10,000 passengers in 2016 and 2017.

#From previous steps
airline_bumping = pd.read_csv("airline_bumping.csv")
print(airline_bumping.head())
airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()

#Create new col, bumps_per_10k: no. of bumps per 10k passengers for each airline
airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000

<script.py> output:

                 airline  year  nb_bumped  total_passengers
    0    DELTA AIR LINES  2017        679          99796155
    1     VIRGIN AMERICA  2017        165           6090029
    2    JETBLUE AIRWAYS  2017       1475          27255038
    3    UNITED AIRLINES  2017       2067          70030765
    4  HAWAIIAN AIRLINES  2017         92           8422734

**Instructions 4/4**

- Print airline_totals to see the results of your manipulations.

#From previous steps
airline_bumping = pd.read_csv("airline_bumping.csv")
print(airline_bumping.head())
airline_totals = airline_bumping.groupby("airline")[["nb_bumped", "total_passengers"]].sum()
airline_totals["bumps_per_10k"] = airline_totals["nb_bumped"] / airline_totals["total_passengers"] * 10000

#Print airline_totals
print(airline_totals)

<script.py> output:

                     nb_bumped  total_passengers  bumps_per_10k
airline                                                        
ALASKA AIRLINES           1392          36543121          0.381
AMERICAN AIRLINES        11115         197365225          0.563
DELTA AIR LINES           1591         197033215          0.081
EXPRESSJET AIRLINES       3326          27858678          1.194
FRONTIER AIRLINES         1228          22954995          0.535
HAWAIIAN AIRLINES          122          16577572          0.074
JETBLUE AIRWAYS           3615          53245866          0.679
SKYWEST AIRLINES          3094          47091737          0.657
SOUTHWEST AIRLINES       18585         228142036          0.815
SPIRIT AIRLINES           2920          32304571          0.904
UNITED AIRLINES           4941         134468897          0.367
VIRGIN AMERICA             242          12017967          0.201

Masterful manipulation! 
Now you'll need to export this so you can share it with others.

**Exercise**

_DataFrame to CSV_

You're almost there! To make things easier to read, you'll need to sort the data and export it to CSV so that your colleagues can read it.

pandas as pd has been imported for you.

**Instructions**

- Sort airline_totals by the values of bumps_per_10k from highest to lowest, storing as airline_totals_sorted.

#Create airline_totals_sorted
airline_totals_sorted = airline_totals.sort_values(by="bumps_per_10k", ascending=False)

- Print your sorted DataFrame.

#Print airline_totals_sorted
print(airline_totals_sorted)

- Save the sorted DataFrame as a CSV called "airline_totals_sorted.csv".

#Save as airline_totals_sorted.csv
airline_totals_sorted.to_csv("airline_totals_sorted.csv")

<script.py> output:

                         nb_bumped  total_passengers  bumps_per_10k
    airline                                                        
    EXPRESSJET AIRLINES       3326          27858678          1.194
    SPIRIT AIRLINES           2920          32304571          0.904
    SOUTHWEST AIRLINES       18585         228142036          0.815
    JETBLUE AIRWAYS           3615          53245866          0.679
    SKYWEST AIRLINES          3094          47091737          0.657
    AMERICAN AIRLINES        11115         197365225          0.563
    FRONTIER AIRLINES         1228          22954995          0.535
    ALASKA AIRLINES           1392          36543121          0.381
    UNITED AIRLINES           4941         134468897          0.367
    VIRGIN AMERICA             242          12017967          0.201
    DELTA AIR LINES           1591         197033215          0.081
    HAWAIIAN AIRLINES          122          16577572          0.074
    
Excellent exporting! 
Now you can share these insights about your competitors with your team.