# Statistical Methods in Pandas - Lab

## Introduction

In this lab you'll get some hands-on experience using some of the key summary statistics methods in Pandas.

## Objectives
You will be able to:

* Understand and use the df.describe() and df.info() summary statistics methods
* Use built-in Pandas methods for calculating summary statistics (.mean(), .std(), .count(), .sum(), .median(), and .quantile())
* Apply a function to every element in a Series or DataFrame using s.apply() and df.applymap()


## Getting Started

For this lab, we'll be working with a dataset containing information on various lego datasets.  You will find this dataset in the file `lego_sets.csv`.  

In the cell below:

* Import pandas and set the standard alias of `pd`
* Load in the `lego_sets.csv` dataset using the `read_csv()` function
* Display the head of the DataFrame to get a feel for what we'll be working with

In [21]:
# Your code here
import pandas as pd
df = pd.read_csv('lego_sets.csv')

In [2]:
df.head()

Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
0,6-12,29.99,2.0,277.0,4.0,Catapult into action and take back the eggs fr...,75823.0,Use the staircase catapult to launch Red into ...,Average,Bird Island Egg Heist,4.5,Angry Birds™,4.0,US
1,6-12,19.99,2.0,168.0,4.0,Launch a flying attack and rescue the eggs fro...,75822.0,Pilot Pig has taken off from Bird Island with ...,Easy,Piggy Plane Attack,5.0,Angry Birds™,4.0,US
2,6-12,12.99,11.0,74.0,4.3,Chase the piggy with lightning-fast Chuck and ...,75821.0,Pitch speedy bird Chuck against the Piggy Car....,Easy,Piggy Car Escape,4.3,Angry Birds™,4.1,US
3,12+,99.99,23.0,1032.0,3.6,Explore the architecture of the United States ...,21030.0,Discover the architectural secrets of the icon...,Average,United States Capitol Building,4.6,Architecture,4.3,US
4,12+,79.99,14.0,744.0,3.2,Recreate the Solomon R. Guggenheim Museum® wit...,21035.0,Discover the architectural secrets of Frank Ll...,Challenging,Solomon R. Guggenheim Museum®,4.6,Architecture,4.1,US


## Getting DataFrame-Level Statistics

We'll begin by getting some overall summary statistics on the dataset.  There are two ways we'll get this information-- `.info()` and `.describe()`.

### Using `.info()`

The `.info()` method provides us metadata on the DataFrame itself.  This allows to answer questions such as:

* What data type does each column contain?
* How many rows are in my dataset? 
* How many total non-missing values does each column contain?
* How much memory does the DataFrame take up?

In the cell below, call our DataFrame's `.info()` method. 

In [3]:
# Your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12261 entries, 0 to 12260
Data columns (total 14 columns):
ages                 12261 non-null object
list_price           12261 non-null float64
num_reviews          10641 non-null float64
piece_count          12261 non-null float64
play_star_rating     10486 non-null float64
prod_desc            11884 non-null object
prod_id              12261 non-null float64
prod_long_desc       12261 non-null object
review_difficulty    10206 non-null object
set_name             12261 non-null object
star_rating          10641 non-null float64
theme_name           12258 non-null object
val_star_rating      10466 non-null float64
country              12261 non-null object
dtypes: float64(7), object(7)
memory usage: 1.3+ MB


#### Interpreting the Results

Read the output above, and then answer the following questions:

How many total rows are in this DataFrame?  How many columns contain numeric data? How many contain categorical data?  Identify at least 3 columns that contain missing values. 

Write your answer below this line:
________________________________________________________________________________________________________________________________

- Total rows in this DataFrame: 12,261
- Columns that contain numeric data: 7
- Columns that contain categorical data: 7
- Columnds with missing data: 
    - num_reviews
    - play_star_rating 
    - prod_desc
    - review_difficulty
    - star_rating
    - val_star_rating

## Using `.describe()`

Whereas `.info()` provides statistics about the DataFrame itself, `.describe()` returns output containing basic summary statistics about the data contained with the DataFrame.  

In the cell below, call the DataFrame's `.describe()` method. 

In [4]:
# Your code here
df.describe()

Unnamed: 0,list_price,num_reviews,piece_count,play_star_rating,prod_id,star_rating,val_star_rating
count,12261.0,10641.0,12261.0,10486.0,12261.0,10641.0,10466.0
mean,65.141998,16.826238,493.405921,4.337641,59836.75,4.514134,4.22896
std,91.980429,36.368984,825.36458,0.652051,163811.5,0.518865,0.660282
min,2.2724,1.0,1.0,1.0,630.0,1.8,1.0
25%,19.99,2.0,97.0,4.0,21034.0,4.3,4.0
50%,36.5878,6.0,216.0,4.5,42069.0,4.7,4.3
75%,70.1922,13.0,544.0,4.8,70922.0,5.0,4.7
max,1104.87,367.0,7541.0,5.0,2000431.0,5.0,5.0


#### Interpreting the Results

The output contains descriptive statistics corresponding to the columns.  Use these to answer the following questions:

How much is the standard deviation for piece count?  How many pieces are in the largest lego set?  How many in the smallest lego set? What is the median `val_star_rating`?

________________________________________________________________________________________________________________________________

- Standard deviation for piece count: 825 pieces
- Number of pieces in the largest lego set: 7,541
- Number of pieces in the smallest lego set: 1
- Median val_star_rating: 4.3

## Getting Summary Statistics

Pandas also allows us to easily compute individual summary statistics using built-in methods.  Next, we'll get some practice using these methods. 

In the cell below, compute the median value of the `star_rating` column.

In [7]:
# Your code here
df['star_rating'].median()

4.7

Next, get a count of the total number of values in `play_star_rating`.

In [8]:
# Your code here
df['play_star_rating'].count()

10486

Now, compute the standard deviation of the `list_price` column.

In [9]:
# Your code here
df['list_price'].std()

91.9804293059243

If we bought every single lego set in this dataset, how many pieces would we have?  

> **Note**: If you truly want to answer this accurately, and are up for the challenge, try and remove duplicate lego-set entries before summing the pieces. That is, many of the lego sets are listed multiple times in the dataset above, depending on the country where it is being sold and other unique parameters. If you're stuck, just practice calculating the total number of pieces in the dataset for now.

In [22]:
df = df.sort_values('country', ascending=False)
# df.head()
num_sets_with_duplicates = df.shape[0]
df = df.drop_duplicates(['set_name', 'prod_desc'], keep='first')
num_sets_without_duplicates = df.shape[0]
print(num_sets_with_duplicates)
print(num_sets_without_duplicates)
print(num_sets_with_duplicates - num_sets_without_duplicates)

12261
758
11503


In [23]:
# Your code here
df['piece_count'].sum()

324694.0

Now, let's try getting the value for the 90% quantile for all numerical columns.  Do this in the cell below.

In [24]:
# Your code here
df.quantile(.9)

list_price             99.99
num_reviews            37.00
piece_count           925.60
play_star_rating        5.00
prod_id             75821.30
star_rating             5.00
val_star_rating         5.00
Name: 0.9, dtype: float64

## Getting Summary Statistics on Categorical Data

For obvious reasons, most of the methods we've used so far only work with numerical data--there's no way to calculate the standard deviation of a column containing string values. However, there are some things that we can discover about columns containing categorical data. 

In the cell below, get the `.unique()` values contained within the `review_difficulty` column. 

In [25]:
# Your code here
df['review_difficulty'].unique()

array(['Average', nan, 'Easy', 'Challenging', 'Very Easy',
       'Very Challenging'], dtype=object)

Now, let's get the `value_counts` for this column, to see how common each is. 

In [26]:
# Your code here
df['review_difficulty'].value_counts()
print(df[df['review_difficulty']=='Very Challenging'])

    ages  list_price  num_reviews  piece_count  play_star_rating  \
693  14+      199.99          1.0       1966.0               5.0   

                                             prod_desc  prod_id  \
693  Collect the ultimate long-range Rebel starfigh...  75181.0   

                                        prod_long_desc review_difficulty  \
693  Own part of Star Wars history with the Y-Wing ...  Very Challenging   

                set_name  star_rating  theme_name  val_star_rating country  
693  Y-Wing Starfighter™          5.0  Star Wars™              5.0      US  


As you can see, these provide us quick and easy ways to get information on columns containing categorical information.  


## Using `.applymap()`

When working with pandas DataFrames, we can quickly compute functions on the data contained by using the `applymap()` function and passing in a lambda function. 

For instance, we can use `applymap()` to return a version of the DataFrame where every value has been converted to a string.

In the cell below:

* Call our DataFrame's `.applymap()` function and pass in `lambda x: str(x)`
* Call our new `string_df` object's `.info()` method to confirm that everything has been cast to a string

In [27]:
string_df = df.applymap(lambda x: str(x))

In [29]:
string_df.info()
string_df.sample(15)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 758 entries, 0 to 3239
Data columns (total 14 columns):
ages                 758 non-null object
list_price           758 non-null object
num_reviews          758 non-null object
piece_count          758 non-null object
play_star_rating     758 non-null object
prod_desc            758 non-null object
prod_id              758 non-null object
prod_long_desc       758 non-null object
review_difficulty    758 non-null object
set_name             758 non-null object
star_rating          758 non-null object
theme_name           758 non-null object
val_star_rating      758 non-null object
country              758 non-null object
dtypes: object(14)
memory usage: 88.8+ KB


Unnamed: 0,ages,list_price,num_reviews,piece_count,play_star_rating,prod_desc,prod_id,prod_long_desc,review_difficulty,set_name,star_rating,theme_name,val_star_rating,country
10773,7-14,35.4929,4.0,119.0,5.0,Bring the Wizarding World to life with Harry a...,71247.0,Enhance your LEGO® DIMENSIONS™ experience with...,Easy,Harry Potter™ Team Pack,5.0,DIMENSIONS™,4.3,NZ
315,2-5,19.99,1.0,19.0,3.0,Enjoy a tea party with Belle and her extraordi...,10877.0,Enter the kitchen of the enchanted castle to s...,,Belle´s Tea Party,4.0,DUPLO®,4.0,US
327,1½-3,4.99,,6.0,,Create racing role-play stories with the color...,10860.0,Little racing drivers will love to build and r...,,My First Race Car,,DUPLO®,,US
427,4-7,19.99,,115.0,,Pick up some tasty organic goodies from Mia an...,10749.0,Start the day out right with goodies from Mia’...,,Mia's Organic Food Market,,Juniors,,US
149,7-12,14.99,5.0,223.0,4.7,Create a world of scary 3-in-1 Mythical Creatu...,31073.0,Enjoy monstrous adventures with the 3-in-1 Myt...,Easy,Mythical Creatures,4.4,Creator 3-in-1,5.0,US
179,16+,99.99,134.0,1077.0,4.4,Take this MINI Cooper for a nostalgic drive do...,10242.0,Take the iconic MINI Cooper for a drive! This ...,Average,MINI Cooper,4.7,Creator Expert,4.5,US
534,5+,3.99,13.0,8.0,4.3,Discover new heroes and villains in LEGO® Mini...,71020.0,Bring exciting new play possibilities to exist...,Very Easy,THE LEGO® BATMAN MOVIE Series 2,4.8,Minifigures,4.0,US
104,5-12,49.99,5.0,200.0,3.8,Keep the crook from escaping his boat ride to ...,60129.0,Lock up the crook and transport him to Prison ...,Easy,Police Patrol Boat,3.2,City,2.0,US
506,8+,29.99,2.0,1.0,5.0,,45517.0,This standard 10V DC transformer allows you to...,Very Easy,Transformer 10V DC,4.0,MINDSTORMS®,2.5,US
486,5-12,9.99,,86.0,,Help Star-Lord grab the stolen mixtape from Ne...,76090.0,Join Star-Lord in a LEGO® Marvel Super Heroes ...,,Mighty Micros: Star-Lord vs. Nebula,,Marvel Super Heroes,,US


Note that everything--even the `NaN` values, has been cast to a string in the example above. 

Note that for pandas Series objects (such as a single column in a DataFrame), we can do the same thing using the `apply()` method.  

This is just one example of how we can quickly compute custom functions on our DataFrame--this will become especially useful when we learn how to **_normalize_** our datasets in a later section!

## Summary

In this lab, we learned how to:

* Understand and use the df.describe() and df.info() summary statistics methods
* Use built-in Pandas methods for calculating summary statistics (.mean(), .std(), .count(), .sum(), .median(), and .quantile())
* Apply a function to every element in a Series or DataFrame using s.apply() and df.applymap()