## Exploratory Data Analysis in Python



## Course Description

How do we get from data to answers? Exploratory data analysis is a process for exploring datasets, answering questions, and visualizing results. This course presents the tools you need to clean and validate data, to visualize distributions and relationships between variables, and to use regression models to predict and explain. You'll explore data related to demographics and health, including the National Survey of Family Growth and the General Social Survey. But the methods you learn apply to all areas of science, engineering, and business. You'll use Pandas, a powerful library for working with data, and other core Python libraries including NumPy and SciPy, StatsModels for regression, and Matplotlib for visualization. With these tools and skills, you will be prepared to work with real data, make discoveries, and present compelling results.

##  Read, clean, and validate
Free
0%

The first step of almost any data project is to read the data, check for errors and special cases, and prepare data for analysis. This is exactly what you'll do in this chapter, while working with a dataset obtained from the National Survey of Family Growth.

    DataFrames and Series    50 xp
    Read the codebook    50 xp
    Exploring the NSFG data    100 xp
    Clean and Validate    50 xp
    Validate a variable    50 xp
    Clean a variable    100 xp
    Compute a variable    100 xp
    Filter and visualize    50 xp
    Make a histogram    100 xp
    Compute birth weight    100 xp
    Filter    100 xp


##  Distributions
0%

In the first chapter, having cleaned and validated your data, you began exploring it by using histograms to visualize distributions. In this chapter, you'll learn how to represent distributions using Probability Mass Functions (PMFs) and Cumulative Distribution Functions (CDFs). You'll learn when to use each of them, and why, while working with a new dataset obtained from the General Social Survey.

    Probability mass functions    50 xp
    Make a PMF    100 xp
    Plot a PMF    100 xp
    Cumulative distribution functions    50 xp
    Make a CDF    100 xp
    Compute IQR    100 xp
    Plot a CDF    100 xp
    Comparing distributions    50 xp
    Distribution of education    50 xp
    Extract education levels    100 xp
    Plot income CDFs    100 xp
    Modeling distributions    50 xp
    Distribution of income    100 xp
    Comparing CDFs    100 xp
    Comparing PDFs    100 xp


##  Relationships
0%

Up until this point, you've only looked at one variable at a time. In this chapter, you'll explore relationships between variables two at a time, using scatter plots and other visualizations to extract insights from a new dataset obtained from the Behavioral Risk Factor Surveillance Survey (BRFSS). You'll also learn how to quantify those relationships using correlation and simple regression.

    Exploring relationships    50 xp
    PMF of age    100 xp
    Scatter plot    100 xp
    Jittering    100 xp
    Visualizing relationships    50 xp
    Height and weight    100 xp
    Distribution of income    100 xp
    Income and height    100 xp
    Correlation    50 xp
    Computing correlations    100 xp
    Interpreting correlations    50 xp
    Simple regression    50 xp
    Income and vegetables    100 xp
    Fit a line    100 xp


##  Multivariate Thinking
0%

Explore multivariate relationships using multiple regression to describe non-linear relationships and logistic regression to explain and predict binary variables.

    Limits of simple regression    50 xp
    Regression and causation    50 xp
    Using StatsModels    100 xp
    Multiple regression    50 xp
    Plot income and education    100 xp
    Non-linear model of education    100 xp
    Visualizing regression results    50 xp
    Making predictions    100 xp
    Visualizing predictions    100 xp
    Logistic regression    50 xp
    Predicting a binary variable    100 xp
    Next steps    50 xp 
    

## DataFrames and Series



Welcome to Exploratory Data Analysis in Python.  The instructor's name is Allen Downey.  [The goal of exploratory data analysis is to answer the questions and guide decision making].  As a first example, we'll start with a simple __question: what is the average birth weight of babies in teh United States?__  To answera question like this, we have to find an appropriate dataset or run an experiment to collect it.  Then we have to get the data into our development environment and prepare it for analysis, which involves cleaning and validation.  For this questionwe'll use data from the National Survery of Family Growth, which is available from the National Center for Health Statistics.  The 2012-2015 dataset includes information about a representative sample of women in the USA and their children.  

The Python module we'll use to read and analyze data in Pandas.  Pandas can read data in most common formats, including CSV, Excel, and the format NSFG data is in, HDF5 (DO you recall how we test importing data from different sources with Pandas? get remember it ).  The result from "pd.read_hdf('nsfg.hdf5', 'nsfg')" is a DataFrame, which is the primary data structure Pandas uses to store data.  Using "df.head()" giving us pregnancy for each of women who participated in the survey, and one column for each variable.  The DF has an attribute ".shape", which is the number of rows and columns.  And the ".columns" attributes, as an [index] (DF are formed by 3 object, index, columns, data, do you recall when we learned this topic, go back to read it again).  Thats another Pandas data structure, similar to a list; in this case its a list of variables names, which are strings.  For the reliable information about the data, you have to read teh documentation.  Say what does this columns "birthwgt_lb1" means?  The documentation tell us that it is the weight in pounds of teh first baby from this pregnancy, for cases of live birth.  

In many ways, a DF is like a Python's dictionary, where the variable names are the keys and the columns are the values.  You can select a column from a DF using the bracket operator, with a string as the key.  The result is a Series, which is another Pandas data structure.  In this case the Series contains the birth weights, in pounds, of the live births (or in the case of multiple births, the first baby).  The "df.head()" shows the first 5 rows of the Seris and the name of the Series, and the datatype; float64 means that these values are 64-bit floating point numbers.  Notice that one of the values is NaN, which stands for "Not a Number".  NaN is a special value that can indicate invalid or missing data.  In that example, the pregnancy did not end in live birth, so birth weight is inapplicable.  



In [12]:
import pandas as pd


df = pd.read_hdf('nsfg.hdf5', 'nsfg')
print(type(df), '\n')


print(df.head(), '\n')

print(df.shape, '\n')
print(df.columns, '\n')


print("Case Id: \n", df['caseid'])

<class 'pandas.core.frame.DataFrame'> 

   caseid  outcome  birthwgt_lb1  birthwgt_oz1  prglngth  nbrnaliv  agecon  \
0   60418        1           5.0           4.0        40       1.0    2000   
1   60418        1           4.0          12.0        36       1.0    2291   
2   60418        1           5.0           4.0        36       1.0    3241   
3   60419        6           NaN           NaN        33       NaN    3650   
4   60420        1           8.0          13.0        41       1.0    2191   

   agepreg  hpagelb  wgt2013_2015  
0   2075.0     22.0   3554.964843  
1   2358.0     25.0   3554.964843  
2   3308.0     52.0   3554.964843  
3      NaN      NaN   2484.535358  
4   2266.0     24.0   2903.782914   

(9358, 10) 

Index(['caseid', 'outcome', 'birthwgt_lb1', 'birthwgt_oz1', 'prglngth',
       'nbrnaliv', 'agecon', 'agepreg', 'hpagelb', 'wgt2013_2015'],
      dtype='object') 

Case Id: 
 0       60418
1       60418
2       60418
3       60419
4       60420
        ...  
9

## Read the codebook

When you work with datasets like the NSFG, it is important to read the documentation carefully. If you interpret a variable incorrectly, you can generate nonsense results and never realize it. So before you start coding, you'll need to get familiar with the NSFG codebook, which describes every variable.

Here is the documentation from the NSFG codebook for "BIRTHWGT_OZ1":

birthwgt_oz1 codebook

How many respondents refused to answer this question?

Possible Answers

    1
    1
    35
    2
    48-49
    3
    2967
    4


<img src='https://assets.datacamp.com/production/repositories/4025/datasets/0d2a0c18b63f3ddf056858c145a6bdc022d8656c/Screenshot%202019-03-31%2019.16.14.png'>

## Exploring the NSFG data

To get the number of rows and columns in a DataFrame, you can read its shape attribute.

To get the column names, you can read the columns attribute. The result is an Index, which is a Pandas data structure that is similar to a list. Let's begin exploring the NSFG data! It has been pre-loaded for you into a DataFrame called nsfg.
Instructions 1/4
25 XP

    Question 1
    Calculate the number of rows and columns in the DataFrame nsfg.
    
    
    Question 2
    Display the names of the columns in nsfg.
    
    
    Question 3
    Select the column 'birthwgt_oz1' and assign it to a new variable called ounces.
    
    
    Question 4
    Display the first 5 elements of ounces.


In [None]:
# Display the number of rows and columns
nsfg.shape

# Display the names of the columns
nsfg.columns

# Select column birthwgt_oz1: ounces
ounces = nsfg['birthwgt_oz1']

# Print the first 5 elements of ounces
print(____)

## Clean and Validate



In the previous lesson, we read data from the National Survery of Family Growth and selected a column from a DataFrame.  [In this lesson, we'll check for errors and prepare the data for analysis].  We'll use the same DF we used in the previous lesson - nsfg, which contains one row for each pregnancy  in the survey.  We'll select the variable "birthwgt_lb1", which contains the pound part of birth weight, and assign it to pounds.  And "birthwgt_oz1" contains the ounce part of birth weight, so we'll assign that to ounces.  

[Before we do anything with this data, we have to validate it].  One part of validation is confirming that we are interpreting the data correctly.  We can use the "df.value_counts()" method to see what values appear in pounds and how many times each value appears.  By default, the results are sorted with the most frequent value first, so we use ".sort_index()" method to sort them by value instead (?? You mean use "df.sort_values()" method), with the lightest babies first and heaviest babies last.  As we'd expect, the most frequent values are 6-8 pounds, but there are some very light babies, a few very heavy babies, and two values, 98 and 99, that indicate missing data.  We can validate the results by comparing them to the codebook, which lists the values and their frequencies.  The results here agree with the codebook (__He means the data range is okay with documentation, go back the chack data processing with Pandas course or data cleaning course I guess and re-study it__), so we have some confidence that we are reading and interpreting the data correctly.  

Another way to validate the data in with "df.describe()" method, which computes summary statistics like mean, standard deviation, min, and max (__But be careful with data type and range, as value based categories will appear min, max, mean statistic summary__).  Here we have the results for pounds.  The "count" is the number of values, The "minimum" and "maximum" values are 0 and 99 (Here the 98 and 99 are actually not real values, but specific category labels to represent value missing or bad things), and the 50th percentile, which is the [median] (Why based on statistic knowledge, sometime we don't use avg but use median instead?   {__Why median is used instead of average?__
The mean is the most frequently used measure of central tendency because it uses all values in the data set to give you an average. For data from skewed distributions, the median is better than the mean because it isn't influenced by extremely large values.}   [][Google this topic], and I think one of the course talked about this topic), is 7.  

The mean is about 8.05, but that doesn't mean much because it includes the special values 98 and 99.  Before we can really compute the mean, we have to replace those values with NaN to represent missing data.  The "df.replace()" method does what we want; it takes a list of values we want to replace and the values we want to replace them with.  The "np.nan" means we are getting the special value NaN form the NumPy library.  __The result from the "df.replace()" method is a new Series, whcih we assign back to replace its origin__.  Remember that the mean of the original Series was about 8.05 pounds.  The mean of the new Series is about 6.7 pounds.  It makes a big difference when you remove a few 98 and 99 pounds values.  

[Instead of making a new Series, you can call "df.replace(inplace=True)" with "replace=True", which modifies the existing Series in place, that is without making a copy].  Here's what that look like for ounces.  Since we didn't make a new Series, we don't have to assign it back to its origin.  


Now say we want to combine pounds and ounces into a single Series that contains total birth weight.  Arithmetic operators work with Series objects; so to convert from ounces to pounds, we can divide by 16 (there are 16 ounces in a pound).  Then we can add the two Series objects to get the total (Compare the calculation difference between SQL and Pandas, and recall in which course we learned such topic, re-study it and [][Google this topic]).  And here are the results.  The refreshed more accurate values are mean 7.1 pounds, which is little more than what we got before we added in the ounces part.  [Now we are close to answering our original question, the average birth weight for babiesin the USA] (__See, we need to check its distribution, cause the median will be more representive than mean in the skewed distribution__[][Google charm]).  But as we'll see in the next lesson, we're not there yet.  




## The values are based on how hard you are thinking and how hard you are pursuing

In [32]:
print(df.head())
print('\n')


print(df['prglngth'].value_counts()[:5])
print('\n')


print(df['prglngth'].value_counts().sort_values()[:5])
print('\n')


print(df['prglngth'].value_counts().sort_index()[:5])    # I know the instructor is wrong about this function

print(df['birthwgt_lb1'].describe())


import numpy as np
df['birthwgt_lb1'] = df['birthwgt_lb1'].replace([98, 99], np.nan)
df['birthwgt_lb1'].replace('NaN', np.nan, inplace=True)   # =========================================================

print(df['birthwgt_lb1'].describe())
print('\n')



df['Total_weight_pounds'] = df['birthwgt_lb1'] + df['birthwgt_oz1']/16   # ==========================================

print(df['Total_weight_pounds'].describe())

   caseid  outcome  birthwgt_lb1  birthwgt_oz1  prglngth  nbrnaliv  agecon  \
0   60418        1           5.0           4.0        40       1.0    2000   
1   60418        1           4.0          12.0        36       1.0    2291   
2   60418        1           5.0           4.0        36       1.0    3241   
3   60419        6           NaN           NaN        33       NaN    3650   
4   60420        1           8.0          13.0        41       1.0    2191   

   agepreg  hpagelb  wgt2013_2015  Total_weight_pounds  
0   2075.0     22.0   3554.964843               5.2500  
1   2358.0     25.0   3554.964843               4.7500  
2   3308.0     52.0   3554.964843               5.2500  
3      NaN      NaN   2484.535358                  NaN  
4   2266.0     24.0   2903.782914               8.8125  


39    2384
40    1311
38     755
37     432
41     422
Name: prglngth, dtype: int64


48    1
45    3
46    3
0     7
23    7
Name: prglngth, dtype: int64


0      7
1     11
2     50
3  

## Validate a variable

In the NSFG dataset, the variable 'outcome' encodes the outcome of each pregnancy as shown below:
value 	label
1 	Live birth
2 	Induced abortion
3 	Stillbirth
4 	Miscarriage
5 	Ectopic pregnancy
6 	Current pregnancy

The nsfg DataFrame has been pre-loaded for you. Explore it in the IPython Shell and use the methods Allen showed you in the video to answer the following question: How many pregnancies in this dataset ended with a live birth?
Instructions
50 XP
Possible Answers

    6489
    9538
    1469
    6
    

In [None]:
In [1]:
nsfg['outcome'].count_values()
Traceback (most recent call last):
  File "<stdin>", line 72, in exceptionCatcher
    raise exception
  File "<stdin>", line 3361, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "<stdin>", line 3458, in run_code
    self.showtraceback(running_compiled_code=True)
  File "<stdin>", line 2066, in showtraceback
    self._showtraceback(etype, value, stb)
  File "<stdin>", line 72, in exceptionCatcher
    raise exception
  File "<stdin>", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<stdin>", line 1, in <module>
    nsfg['outcome'].count_values()
  File "<stdin>", line 5487, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'count_values'
In [2]:
nsfg['outcome'].value_counts()   # ==================================================================================
Out[2]:                          # Its value_counts() NOT count_values()

1    6489
4    1469
2     947
6     249
5     118
3      86
Name: outcome, dtype: int64

## Clean a variable

In the NSFG dataset, the variable 'nbrnaliv' records the number of babies born alive at the end of a pregnancy.

If you use .value_counts() to view the responses, you'll see that the value 8 appears once, and if you consult the codebook, you'll see that this value indicates that the respondent refused to answer the question.

Your job in this exercise is to replace this value with np.nan. Recall from the video how Allen replaced the values 98 and 99 in the ounces column using the .replace() method:

ounces.replace([98, 99], np.nan, inplace=True)

Instructions
100 XP

    In the 'nbrnaliv' column, replace the value 8, in place, with the special value NaN.
    Confirm that the value 8 no longer appears in this column by printing the values and their frequencies.
