# Intro to Data Science

[Gina Sprint](https://ginasprint.com/)


# Data Processing
What are our learning objectives for this lesson?
* Clean data by filling missing values
* Perform data aggregation w/split-apply-combine

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm up Task(s)
Open our main.py from last class: https://github.com/gsprint23/ZIME-Intro-to-Data-Science/blob/master/PandasFun You'll see that I posted the table practice problem solutions from last class' TODO in there too.
1. Read the documentation on Pandas' `groupby()`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
1. Add code main.py to get only Large group `DataFrame` and print it out
1. Concept question: for any groupby, how do you know how many group `DataFrame` objects there are?
    1. Challenge: how could you find this out programmatically? (lots of ways to solve this actually)

## Today
1. Questions on project part 1?
1. More in PandasFun (`append()`, `value_counts()`, `sort_values()`, Boolean indexing, etc.)
1. Break
1. DataCleaningFun

## TODO
1. Pandas practice problems below (not graded)
1. Work on Quiz 3 in Moodle

## Pandas Practice Problems
1. Create a CSV file called basketball.csv with the following data:
```
NAME,POS,GP,MIN,PTS
Andrew Nembhard,G,25,760,272
Anton Watson,F,25,448,206
Ben Gregg,F,16,104,40
Chet Holmgren,C,25,659,361
Drew Timme,F,25,675,450
Hunter Sallis,G,25,372,119
Julian Strawther,G,25,647,308
Kaden Perry,F,8,53,14
Martynas Arlauskas,G,15,53,12
Matthew Lang,G,16,52,17
Nolan Hickman,G,25,474,156
Rasir Bolton,G,25,654,271
Will Graves,G,15,28,11
```
1. Load this CSV file into a Pandas DataFrame
1. Add a new row for Joe Few. Example code:
```
new_row = pd.Series(["G",12,20,1], index=df.columns)
new_row.name = "Joe Few"
df = df.append(new_row)
```
1. Add a new column for CLASS (e.g. freshman, sophomore, junior, or senior). Here are the values:
`["Sr", "Jr", "Fr", "Fr", "Jr", "Fr", "So", "Fr", "Jr", "Sr", "Fr", "Sr", "Sr", "Fr"]`
1. Create a Pandas Series with the count of each CLASS. E.g. for the modified dataset with Joe Few:
```
Fr    6
Sr    4
Jr    3
So    1
Name: Class, dtype: int64
```
1. Apply split-apply-combine to the PTS data
    * Split on Class
    * Apply the mean to the PTSs
    * Combine the mean PTSs
1. Write code to print out who has the most points per game. Check your work with the ESPN web page (at the time of this writing):

![](https://raw.githubusercontent.com/GonzagaCPSC222/U5-Visualizing-Data/master/figures/team_leaders.png)

## Data Processing
Data analysts spend a surprising amount of time preparing data for analysis. In fact, a survey was conducted found that cleaning big data is the most time-consuming and least enjoyable task data scientists do!
<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg" width="700">
(image from [https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg))

The goal of data preprocessing is to produce high-quality data to improve mining results and efficiency

At a high level, data preprocessing includes the following steps (these steps are done in any order and often multiple times):
1. Data Exploration (basic understanding of meaning, attributes, values, issues)
2. Data Reduction (reduce size via aggregation, redundant features, etc.)
3. Data Integration (join/merge/combine multiple datasets)
4. Data Cleaning (remove noise and inconsistencies)
    * Dealing with missing values
    * Dealing with incorrect values (e.g., misspelled names, values out of range)
5. Data Transformation (normalize/scale, to discrete values, etc.)

It is important for data mining that your process is transparent and repeatable:
* Can repeat "experiment" and get the same result
* No "magic" steps

It is important, however, to write down steps (log):
* Ideally, someone should be able to take your data, program, and description of steps, rerun everything, and get the same results!

## Data Cleaning
It is not uncommon to have datasets with noisy, invalid, or completely missing values.
1. Noisy vs Invalid Values
    * Noisy implies the value is correct, just recorded incorrectly
        * E.g., decimal place error (5.72 instead of 57.2), wrong categorical value used
    * Invalid implies a noisy value that is not a valid value (for domain)
        * E.g., 57.2X, misspelled categorical data, or value out of range (6 on a 5 point scale)
    * Ways to deal with this:
        * Look for duplicates (when there shouldn't be)
        * Look for outliers
        * Sort and print range of values
    * The term "noisy" may also imply random error or random variance
        * Various techniques to "smooth out" values
        * E.g., using means of bins or regression
2. Missing Values
     * How should we deal with missing values?
        * Discard instances: throw out any row with a missing value
        * Replace with a new value:
            * By hand
            * Use a constant
            * Use a central tendency measure (mean, median, most frequent, ...)
        * Most "probable" value (e.g., regression, using a classifier)
        * Replace either across data set, or based on similar instances
            * E.g. average based on model year
            
Missing values are usually coded as an out of range value, such as an empty string in a text field, -1 in a numeric field that is normally positive, or 0 in a numeric field that cannot take on the value of 0. In the Scipy ecosystem, the common value `NaN` (not a number) is used to denote missing data. There is support in the Scipy libraries to handle `NaN` specially. For example, the Pandas function [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html) returns a Boolean array detecting the `NaN` values element-wise and [`dropna()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) removes `NaN` values from a series or data frame:


In [None]:
## Data Cleaning Example
import numpy as np
import pandas as pd
x = np.arange(0, 10)
ser = pd.Series(x)
ser[1] = np.NaN
ser[5] = np.NaN
nans = ser.isnull()
# count the number of missing values
print(nans.sum())
print(ser)
ser.dropna(inplace=True)
print(ser)

Note: you can learn more about missing data by reading [Pandas website](https://pandas.pydata.org/pandas-docs/stable/missing_data.html).

By learning how to use the Pandas library, we have the skills to perform many of the tasks listed above. In this lesson, we are going to focus on *data cleaning*, modifying the data to make it sufficiently accurate and structured to support the analysis you want to perform. To learn about data cleaning, we are going to clean data by working through an example!

## Data Aggregation
Gathering and summarizing data, perhaps in preparation for statistical analysis or visualization, is called *data aggregation*. For example, suppose you want to investigate the similarities/differences amongst patients in a clinical setting. Suppose specific attributes you are interested in include medical condition, age, and gender. You might *group* the data into two groups by gender: male and female. The grouping allows you to then compute statistics such as the mean and standard deviation for each group, perform hypothesis testing to see if there is a significant age difference between the two groups, or perhaps create a bar chart representing the frequency of each medical condition present in each group. 

## Split-Apply-Combine
Data aggregation typically involves a "split, apply, combine" process:
* Split the data into groups based on some criteria
    * Perform *group by* operations
    * Select or slice data to form a subset
    * Example: Group a data frame by rows (axis 0) or by columns (axis 1)
* Apply a function to each group independently, producing a new value
    * Compute summary statistics (aggregation)
        * Example: Count the size of each group
        * Example: Compute mean, standard deviation, custom stats, etc.
    * Transform the data in the group (transformation)
        * Example: Standardizing data (z-score) within each group
        * Example: Filling missing data with a value derived from each group
    * Discard some groups (filtration)
        * Example: Discarding data that belongs to groups with only a few members
        * Example: Filtering out data based on the group sum or mean
* Combine the results of the function applications into a data structure
    * Example: A series with index corresponding to data frame column names and values representing the column means
    
<img src="https://miro.medium.com/max/1400/1*w2oGdXv5btEMxAkAsz8fbg.png" width="500">

(image from [https://miro.medium.com/max/1400/1*w2oGdXv5btEMxAkAsz8fbg.png](https://miro.medium.com/max/1400/1*w2oGdXv5btEMxAkAsz8fbg.png))

### Group By
In the split step, we want to divide a dataset into a mapping of group names to group data. Typically to create the groups, you'll perform a "group by" operation. A group by operation is grouping (or partitioning) rows or attributes values by another attribute value.

For example, you might have a table like the following:

CarName |ModelYear |MSRP
-|-|-
ford pinto |75 |2769
toyota corolla |75 |2711
ford pinto |76|3025
toyota corolla|77 |2789

Let's group the rows by ModelYear, meaning putting all the cars from the year 1975 in one list, all of the cars from the year 1976 in one list, etc. This would create the following partitions (sub tables):

CarName |ModelYear |MSRP
-|-|-
ford pinto |75 |2769
toyota corolla |75 |2711


CarName |ModelYear |MSRP
-|-|-
ford pinto |76|3025


CarName |ModelYear |MSRP
-|-|-
toyota corolla|77 |2789


Then extract the MPG from each list to get a set of different MPG series, one for each year. Then you could visualize the data with model year on the x-axis, MPG on the y-axis, and one box and whisker for each model year.

With the Pandas [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) function, we can divide a data frame into a [`GroupBy`](http://pandas.pydata.org/pandas-docs/stable/groupby.html) object that stores the mapping. For example:

In [1]:
import numpy as np
import pandas as pd

# adapted from http://pandas.pydata.org/pandas-docs/stable/groupby.html
df = pd.DataFrame({"Gender" : ["F", "F", "M", "F", "M", "M", "M", "F"],
                   "AgeGroup" : ["OA", "A", "OA", "YA", "YA", "OA", "A", "YA"], # OA: older adult, A: adult, YA: young adult
                   "Feature1" : np.random.randn(8),
                   "Feature2" : np.random.randn(8)})
print(df)
# GroupBy object (mapping of group name -> group data frame)
gender_groups = df.groupby("Gender")
# groups attribute is a dictionary storing the mapping
print("Groups:", gender_groups.groups)
print("Female data frame")
F_df = gender_groups.get_group("F")
print(F_df)
print("Male data frame")
M_df = gender_groups.get_group("M")
print(M_df)
# confirm M_df is a data frame
print(type(M_df))
# divided the data frame into 2 groups
print(len(df) == len(F_df) + len(M_df))

  Gender AgeGroup  Feature1  Feature2
0      F       OA -0.411167 -1.794967
1      F        A  1.086750  0.089496
2      M       OA  0.010828 -1.840485
3      F       YA  1.127685  2.047733
4      M       YA  0.793994  0.758327
5      M       OA  1.967862 -0.099171
6      M        A  1.106403  0.323073
7      F       YA  0.256400 -0.606352
Groups: {'F': Int64Index([0, 1, 3, 7], dtype='int64'), 'M': Int64Index([2, 4, 5, 6], dtype='int64')}
Female data frame
  Gender AgeGroup  Feature1  Feature2
0      F       OA -0.411167 -1.794967
1      F        A  1.086750  0.089496
3      F       YA  1.127685  2.047733
7      F       YA  0.256400 -0.606352
Male data frame
  Gender AgeGroup  Feature1  Feature2
2      M       OA  0.010828 -1.840485
4      M       YA  0.793994  0.758327
5      M       OA  1.967862 -0.099171
6      M        A  1.106403  0.323073
<class 'pandas.core.frame.DataFrame'>
True


Now we have learned enough background information to dive into learning about aggregating data by working through an example!