## Cleaning Data

### Common methods to look at data

    df.head()      # look at first five rows
    df.tail()      # look at last five rows
    df.shape       # look at shape of data: (rows, columns)
    df.columns     # returns index of column names
    df.info()      # returns additional info of data frame
    
    Other things: objects in dataframes are strings, int is integer, float is number with a decimal

### Exploratory Data Analysis

    df.column_name.value_counts(dropna=False) # individ. columns frequency count, dropna=False counts missing values
    df['column_name'].value_counts(drop=False).head() # also works and can add .head() to see first five
    
    df.describe() # for summary statistics of numerical data

### Visual EDA

    Use bar plots for discrete data counts
    Use histograms for continuous data counts
    
    Histogram
        import matplotlib.pyplot as plt
        df.column.plot('hist')
        plt.show()
    
    Identifying the error
        df[df.population > 1000000000] # will show which countries have over 1 billion pop
    
    Box Plots
        - good for visualizing basic summary statistics
        - outliers, min/max, 25th/50th/75th percentiles
        
        df.boxplot(column='population', by='continent') # plots populations, sorts by continent
        plt.show()
        
        - ends of whiskers show min/max of data, excluding outliers as outliers are shown beyond the whickers
    
    Scatter Plots 
        - relationship between 2 numeric variables
        - flag potentially bad data, errors that may not have been found by looking at 1 variable

## Tidy Data

### "Tidy Data" paper by Hadley Wickham, PhD

    Formalize the way we describe the shape of data
    Gives us a goal when formatting our data
    "Standard way to organize data values within a dataet"
    
    Principles of Tidy Data
        - columns represent separate variables
        - rows represent individual observations
        - observational units form tables
    
    Converting to Tidy Data
        - trying to fix problem of columns containing values, instead of variables
        - solution: pd.melt()
        
        pd.melt(frame=df, id_vars='name', value_vars=['treatment a', 'treatment b'], var_name='treatment', value_name='result')
        # to melt data, first specify dataframe, and which column/columns we want to hold constant
        # here we want the column that stores the names of people to be fixed
        # value_vars specifies columns to melt; if don't specify, will melt all columns not specified in id_vars
        # var_name & value_name rename melted columns
    
### Reshaping data using Melt

    Melting data is the process of turning columns of data into rows; 2 parameters:
        - id_vars = columns of data that do not want to melt (keep in current shape)
        - value_vars = columns do wish to melt into rows
        - by default, if no value_vars provided, all columns not set in id_vars will be melted

### Pivoting: unmelting data

    - opposite of melting
    - turn unigue values of a variable and turn into separate columns
    - sometimes do this bc need to turn an analysis friendly shape into a reporting friendly shape
    - or when dataset violates tidy data, such as when mulitple variables stored in same column
    
    weather_tidy = weather.pivot(index = 'date', columns = 'element', values = 'value')
    
        - index parameter is what we want to keep fixed