# Exploratory Data Analysis
> First steps in exploration

- toc: true
- badges: true
- comments: true
- author: Johann Augustine
- image: images/NYC-harbor.jpg
- categories: [lambda-school, data-science]

## Overview

> Exploratory Data Analysis (EDA) is the critical process of conducting preliminary investigations on data in order to discover patterns, detect anomalies, test hypotheses, and validate assumptions using summary statistics and graphical representations.

When we first start with a new dataset, we often perform exploratory data analysis. The discoveries that we make during this stage of the process drive how we treat our data, the models we choose, the approach we take to analyzing our data, and, in large part, the entirety of our data science methodology and next steps.

## Load a CSV dataset using pandas read_csv

We will be working with data in many different forms throughout our careers. But before we can start to do anything with that data, we need to load it into our workspace. 

> Tip: Start with [Google colab](https://colab.research.google.com) and ease your way into a working local environment on your machine. 

### Pandas
You are likely already familiar with the Python data analysis library pandas. We'll provide a quick overview here and then work through some examples in the next section.

The pandas library includes a wide range of data analysis and manipulation tools. It also offers data structures (Series, DataFrames) that are designed to function well with a variety of data types, including tabular data, time series data, and matrix data (for example, columns with different data types) using read.csv to read files

To begin, we'll look at one of the most commonly used pandas methods: read csv. This method can be used to read data in the comma separated value (csv) format: each row's values are separated by a comma, and new lines (rows) begin on the next line. A csv file can be loaded from a URL or read from a locally stored file on your device. 

There are many options for using read csv, as there are for several other pandas methods. To learn more about these, start with some of the official documentation at https://pandas.pydata.org/docs/.

#### Load a CSV dataset from a URL using pandas read_csv

In [1]:
# Import pandas with the standard alias
import pandas as pd

# Set a variable to the URL you copied above
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data'

# Read or load the data
df = pd.read_csv(url)

#### Load a CSV dataset from a local file using pandas read_csv

When working with data sets, you'll find many of them conveniently stored on online at various locations (UCI Repository, Kaggle, etc.). But you'll often want to download a data set to store it on your local computer. This makes it easy to use tools on your computer to view and edit the file in addition to having a copy safely stored on your hard drive.

The pandas `read_csv()` method can also read in locally saved files. Instead of providing the URL, you will use the path to the file. Think of the path like the address of the file: it's a list of directories leading to the file location. An example path might look like /Users/myname/Documents/data_science/tic-tac-toe.data where the last part of the path is the file name. To read in a local file with pandas, you can use the following code (assume pandas has been imported):

In [None]:
# Read in an example file (from a example user's Downloads folder)
import pandas as pd
df = pd.read_csv('/home/username/Downloads/tic-tac-toe.data')

## Use basic Pandas functions for Exploratory Data Analysis-EDA

Exploratory data analysis (EDA) is a very important part of learning to be a data scientist. And something that experienced data scientists do on a regular basis. We'll be using some of the numerous tools available in the pandas library. Earlier in the module, we learned how to load data sets into notebooks. So now that we have all this data, what do we do with it?

### Basic Information

There are a few methods to look at your DataFrame quickly and get an idea of what's inside. Here are a few of the most common with descriptions of what each method does:

| method        | description                                 |
|:---------------|:--------------------------------------------|
|`df.shape`     | display the size (x, y) |
|`df.head()`    | display the first n rows (default=5) |
|`df.tail()`    | display the last n rows (default=5) |
|`df.describe()`| display the statistics of numerical data types  |
|`df.info()`    | display the number of entries (rows), number of columns, and the data types |

### Column-specific

Sometimes we don't want to look at the entire DataFrame and instead want to focus on a single column or a few columns. There are a few ways to select a column but we'll mainly use the column name. If we have a DataFrame called `df` and a column named "column_1" we could select just a single column by using `df["column_1"]`. Once we have a single column selected, we can use some of the following methods to get more information.

| method            | description                                 |
|:-------------------|:--------------------------------------------|
|`df.columns`     | print a list of the columns |
|`df['column_name']`| select a single column ( returns a Series) |
|`df['column_name'].value_counts()`| count the number of `object` and `boolean` occurrences |
|`df.sort_values(by='column_name')` | sort the values in the given column
|`df.drop()` | remove rows or columns by specifying the label or index of the row/column |

### Missing Values

There is a lot of data out there, and along with that comes the unavoidable fact that some of it will be messy. This means that there will be missing values, "not-a-number" (NaN) occurrences, and problems with zeros not being actually zero. Fortunately, there are a number of pandas methods that make dealing with the mess a little easier.

| method            | description                                       |
|:-------------------|:--------------------------------------------------|
|`df.isnull().sum()` | count and sum the number of null occurrences (NaN or None) |
|`df.fillna()` | fill NaN values in a variety of ways |
|`df.dropna()` | remove values that are NaN or None; by default removes all rows with NaNs|


The above methods cover a lot of ground but we'll work through examples using them on a data set. First, we need some data. We'll use the M&Ms data set for this, because it's small and contains both numeric and Object (string) data types.

In [2]:
# Reads in the data from the website
url_mms = 'https://tinyurl.com/mms-statistics'
df = pd.read_csv(url_mms)

# Looks at the dimensions of your data:
print(df.shape)
(816, 4)

# Looks at first 5 rows of your data:
df.head()

(816, 4)


Unnamed: 0,type,color,diameter,mass
0,peanut butter,blue,16.2,2.18
1,peanut butter,brown,16.5,2.01
2,peanut butter,orange,15.48,1.78
3,peanut butter,brown,16.32,1.98
4,peanut butter,yellow,15.59,1.62


`df.info()` prints information about a DataFrame including the index dtype and columns, non-null values and memory usage:

In [3]:
# DataFrame information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   type      816 non-null    object 
 1   color     816 non-null    object 
 2   diameter  816 non-null    float64
 3   mass      816 non-null    float64
dtypes: float64(2), object(2)
memory usage: 25.6+ KB


 We can use describe to print out the statistics for the numeric columns using `df.describe()`:

In [4]:
df.describe()

Unnamed: 0,diameter,mass
count,816.0,816.0
mean,14.171912,1.419632
std,1.220001,0.714765
min,11.23,0.72
25%,13.22,0.86
50%,13.6,0.92
75%,15.3,1.93
max,17.88,3.62


`df.columns` prints the column labels of the DataFrame.

In [5]:
df.columns

Index(['type', 'color', 'diameter', 'mass'], dtype='object')

Say we dont need a column any more, we can drop it with `df.drop`

In [6]:
# Drop the mass column
df.drop(columns='mass').head()

Unnamed: 0,type,color,diameter
0,peanut butter,blue,16.2
1,peanut butter,brown,16.5
2,peanut butter,orange,15.48
3,peanut butter,brown,16.32
4,peanut butter,yellow,15.59


And lastly say we want to look at all the different values a certain column has. We can do so with 
`df['column name'].value_counts()`

In [7]:
# Count the values in the 'type' column
df['type'].value_counts()

plain            462
peanut butter    201
peanut           153
Name: type, dtype: int64

### Fin

I don't ever remember all these techniques. So I either google what I need or head over to [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) to refresh my memory. 

> Note: This post is a reflection of my first week at Lambda School. For a more rigorous introduction to python, pandas and EDA, you can reference [Python for Data Analysis](https://www.amazon.com/Python-Data-Analysis-Wrangling-IPython/dp/1491957662/ref=pd_lpo_14_t_0/137-0388884-4997641?_encoding=UTF8&pd_rd_i=1491957662&pd_rd_r=f83c895f-1fc3-4edd-b74e-723ab0f64044&pd_rd_w=mcYQh&pd_rd_wg=nIRhM&pf_rd_p=337be819-13af-4fb9-8b3e-a5291c097ebb&pf_rd_r=44MJ6GR5N2YKKFSYHP26&psc=1&refRID=44MJ6GR5N2YKKFSYHP26)