# Exploratory Data Analysis in Python

How do we get from data to answers? Exploratory data analysis is a process for exploring datasets, answering questions, and visualizing results. This course presents the tools you need to clean and validate data, to visualize distributions and relationships between variables, and to use regression models to predict and explain. You'll explore data related to demographics and health, including the National Survey of Family Growth and the General Social Survey. But the methods you learn apply to all areas of science, engineering, and business. You'll use Pandas, a powerful library for working with data, and other core Python libraries including NumPy and SciPy, StatsModels for regression, and Matplotlib for visualization. With these tools and skills, you will be prepared to work with real data, make discoveries, and present compelling results.

## 1. Read, clean, and validate

The first step of almost any data project is to read the data, check for errors and special cases, and prepare data for analysis. This is exactly what you'll do in this chapter, while working with a dataset obtained from the National Survey of Family Growth.

    1.1 DataFrames and Series
    1.2 Read the codebook
    1.3 Exploring the NSFG data
    1.4 Clean and Validate
    1.5 Validate a variable
    1.6 Clean a variable
    1.7 Compute a variable
    1.8 Filter and visualize
    1.9 Make a histogram
    1.10 Compute birth weight
    1.11 Filter

## 2. Distributions

In the first chapter, having cleaned and validated your data, you began exploring it by using histograms to visualize distributions. In this chapter, you'll learn how to represent distributions using Probability Mass Functions (PMFs) and Cumulative Distribution Functions (CDFs). You'll learn when to use each of them, and why, while working with a new dataset obtained from the General Social Survey.

    2.1 Probability mass functions
    2.2 Make a PMF
    2.3 Plot a PMF
    2.4 Cumulative distribution functions
    2.5 Make a CDF
    2.6 Compute IQR
    2.7 Plot a CDF
    2.8 Comparing distributions
    2.9 Distribution of education
    2.10 Extract education levels
    2.11 Plot income CDFs
    2.12 Modeling distributions
    2.13 Distribution of income
    2.14 Comparing CDFs
    2.15 Comparing PDFs

## 3. Relationships

Up until this point, you've only looked at one variable at a time. In this chapter, you'll explore relationships between variables two at a time, using scatter plots and other visualizations to extract insights from a new dataset obtained from the Behavioral Risk Factor Surveillance Survey (BRFSS). You'll also learn how to quantify those relationships using correlation and simple regression.

    3.1 Exploring relationships
    3.2 PMF of age
    3.3 Scatter plot
    3.4 Jittering
    3.5 Visualizing relationships
    3.6 Height and weight
    3.7 Distribution of income
    3.8 Income and height
    3.9 Correlation
    3.10 Computing correlations
    3.11 Interpreting correlations
    3.12 Simple regression
    3.13 Income and vegetables
    3.14 Fit a line

## 4. Multivariate Thinking

Explore multivariate relationships using multiple regression to describe non-linear relationships and logistic regression to explain and predict binary variables.

    4.1 Limits of simple regression
    4.2 Regression and causation
    4.3 Using StatsModels
    4.4 Multiple regression
    4.5 Plot income and education
    4.6 Non-linear model of education
    4.7 Visualizing regression results
    4.8 Making predictions
    4.9 Visualizing predictions
    4.10 Logistic regression
    4.11 Predicting a binary variable
    4.12 Next steps

# Aditional material

- Datacamp course: https://learn.datacamp.com/courses/exploratory-data-analysis-in-python