# 21 Conditioning Data
File(s) needed: Taiwan_CellSurvey_RAW.xlsx, conditioning_example.csv


Conditioning data can include many different operations. We will generally use the term to mean **_getting the data ready for further analysis._** We have already done some of this when working with pandas to read data. We can have some control over which columns and rows are loaded into memory for use in our analysis. We need to go much further than that, however.

It also goes beyond the input validation we did previously because we need to validate data that comes from other sources than just the keyboard. We may also need to modify the data to make it usable or work with specialized data types. And we have already touched on missing data but we often need to handle missing values when we work with data from just about any source.

In this notebook, we will work on the following:
- validating data
- working with categorical variables
- working with dates
- aggregating data
- finding and dealing with missing data

**A reminder here:** we will primarily use pandas for our data conditioning tasks. pandas is built on top of NumPy, so _everything we do in pandas goes through NumPy_. We get the benefits of the pandas functionality over NumPy but we do pay a performance price, which can be a factor of 100x (e.g., see https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ for an example). However, with the speed of today’s technology this is not often important when compared to ease of use. But you should keep in mind for the future that if you need the fastest processing, you should stick with NumPy over pandas. Just understand that you will have to code any pandas functionality you give up.


## Validating data
validating data – removing duplicates, create a data map and plan
What exactly does data validation tell you? Think back to our input validation examples. Let’s talk about what data validation _doesn’t_ tell us first. 

Data validation _does not tell us_ that
- the data is correct or
- there are no outliers.

What it does is this: **data validation gives us the confidence that we can conduct a successful analysis without data errors.** That’s it! Other tasks involved in data conditioning will help us get the data in a state that we can use to solve the problem we started to solve.

### What is in my data?
Nobody really knows what is in a large database. We only see parts of it at any one time because it may be physically impossible to see all of it at once. We could still manually check the database one piece at a time, right? 

The file **15zpnyagi.csv** we used on the last assignment is 3.59 MB in size, plus it  contains 129* columns and 9234 rows of data. How long would it take you to manually review all 1,191,186 cells? Perhaps more importantly, _what is your acceptable error rate?_ There is no way you can review all of the data and not miss something. You might also introduce new data problems. And it is mind-numbingly boring, which is one of the reasons you miss or introduce errors. Python and pandas give you some tools for inspecting your data.

---
`*` Yes, the sheet contains 131 columns but the first two columns just tell you that the data is from New York so I’m not counting them here. The file with the entire US data in it is 164 MB, 131 columns and 166,698 rows! That's 21,837,438 cells to inspect!

### Data map and data plan
A **data map** provides you with an overview of your data. It will give you the “big picture” view that can help you find redundant variablesm, missing variables, potential errors, and values that need to be transformed. Reviewing these potential data issues leads to a **data plan**, which is the tasks that will need to be be performed to properly condition the data. Remember the whole “solve the problem first” thing?

We will talk about some tools that can help us get a view of our data so we can devcelop a plan of attack.

## Duplicate values
Duplicate data will lead to bad results. The duplicated data receives more weight in the analysis than unduplicated data. A couple of duplicated points may not matter much but how do you know only a couple of points are duplicated? Plus, your analysis should always be able to withstand scrutiny and be as reproducible as possible.  So you need to find and remove duplicates.

#### Finding duplicates
Use the DataFrame method `.duplicated()` to find duplicate records.

Let's try it on our example dataset.

In [2]:
# We will need a numpy method later
import numpy as np
import pandas as pd

In [3]:
# Example: finding duplicates
example_df = pd.read_csv("conditioning_example.csv")

# the info() method is one tool for seeing how the data is setup
example_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
Student           7 non-null int64
Dept              7 non-null object
Class             7 non-null int64
Grade             6 non-null object
Date completed    6 non-null object
dtypes: int64(2), object(3)
memory usage: 360.0+ bytes


In [4]:
# Example: Find the duplicates
# print the raw data for comparison
print(example_df)
print()                              # just adds a blank line for better reading

# use duplicated() to find the duplicate values
search = pd.DataFrame.duplicated(example_df)
print(search[search == True])


   Student  Dept  Class Grade Date completed
0      101   MIS   3335     A      4/28/2018
1      101  MGMT   4347     B            NaN
2      101   MIS   3335     A      4/28/2018
3      102  MGMT   4347     C      4/27/2018
4      102   MIS   3328     A       5/1/2018
5      103  MGMT   4347   NaN      4/28/2018
6      103  QMTH   3335     D       5/3/2018

2    True
dtype: bool


Duplicate values can be removed and a new copy of the data can be saved without them by using the `drop_duplicates()` method of the DataFrame. The following code leaves us with a new set of data without the duplicate record.

In [5]:
# remove duplicates
example_df2 =example_df.drop_duplicates()
print(example_df2)

   Student  Dept  Class Grade Date completed
0      101   MIS   3335     A      4/28/2018
1      101  MGMT   4347     B            NaN
3      102  MGMT   4347     C      4/27/2018
4      102   MIS   3328     A       5/1/2018
5      103  MGMT   4347   NaN      4/28/2018
6      103  QMTH   3335     D       5/3/2018


### Example: Taiwan survey data
Let's try it with a bigger dataset. The file `Taiwan_CellSurvey_RAW.xlsx` contains data from actual survey results conducted on computer and cell phone use in Taiwan. This version of the data contains a subset of the cell phone data.

First, let's read the data and get an idea of what it looks lilke.

In [6]:
# Example: real survey data
df = pd.read_excel('Taiwan_CellSurvey_RAW.xlsx', index_column = 'SurveyID')
df.head()

Unnamed: 0,SurveyID,Age,Gender,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,...,PBI2,PBI3,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001
0,S4,,2,1,3,1.0,2.0,,,,...,2,4,1,6.3,2.1,6,4,6,,3
1,S5,,1,1,3,2.0,2.0,,10.0,10.0,...,2,4,2,4.9,2.8,4,3,4,,3
2,S8,,2,1,2,2.0,2.0,,25.0,25.0,...,3,3,1,4.2,2.8,5,5,5,,2
3,S12,,2,1,2,3.0,2.0,,12.0,12.0,...,1,4,1,5.6,1.4,5,5,5,,2
4,S13,,2,1,3,,2.0,,10.0,10.0,...,0,0,0,0.0,0.0,4,4,4,,3


In [7]:
# .info() on this data will give you much more than the previous example!
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272 entries, 0 to 271
Data columns (total 54 columns):
SurveyID           272 non-null object
Age                257 non-null float64
Gender             272 non-null int64
Type               272 non-null int64
Education          272 non-null int64
Income             263 non-null float64
Employed           271 non-null float64
FullPart           82 non-null float64
@1OwnCell          261 non-null float64
@2UsedCell         261 non-null float64
@3OwnHousePhone    266 non-null float64
@4UseFreq          270 non-null float64
BQ4a               270 non-null float64
BQ4aCont           270 non-null float64
@5UseLength        268 non-null float64
BQ5a               268 non-null float64
BQ5aCont           268 non-null float64
@6aTexting         202 non-null float64
@6bEmail           90 non-null float64
@6cInternet        83 non-null float64
@6dBank            15 non-null float64
@6eBills           14 non-null float64
@6fFacebook        31 non-n

In [8]:
# Find any duplicates
search = pd.DataFrame.duplicated(df)
print(search[search == True])

52     True
63     True
185    True
186    True
dtype: bool


Look for these rows in the original Excel file before we move on. Don't forget to account for the headers in the Excel file and the zero-based index of the DataFrame.

Once you are satisfied with those results, save a new copy of the DataFrame without the duplicate values.

**If you are intending to save any changes to disk, make sure you work on a _copy_ of the data file and not on the original data file itself. If we have the original data file available we can always start over.**

In [9]:
# remove duplicates
df2 = df.drop_duplicates()
print(df2)

    SurveyID   Age  Gender  Type  Education  Income  Employed  FullPart  \
0         S4   NaN       2     1          3     1.0       2.0       NaN   
1         S5   NaN       1     1          3     2.0       2.0       NaN   
2         S8   NaN       2     1          2     2.0       2.0       NaN   
3        S12   NaN       2     1          2     3.0       2.0       NaN   
4        S13   NaN       2     1          3     NaN       2.0       NaN   
5        S14   NaN       2     1          3     5.0       1.0       1.0   
6        S15   NaN       2     1          2     3.0       1.0       2.0   
7        S16   NaN       2     1          3     NaN       2.0       NaN   
8        S35   NaN       1     1          4     5.0       1.0       1.0   
9        S37   NaN       1     1          4     5.0       1.0       1.0   
10       S43   NaN       2     1          3     2.0       2.0       NaN   
11       S44   NaN       1     1          3     5.0       2.0       NaN   
12       S46   NaN       

## Working with categorical variables
This would be a good place for a review of data types from QMTH 2330.
### Types of variables
- Quantitative: we are studying something that has a numerical value.
	- Numerical as in "representing a numerical value," not just that it is a number.
	- Numbers can be (and often are) categorical data.
- Categorical: we are studying something that only has a descriptive value.

We can make visualizations with both types. The main difference between them is …

**_we can perform calculations on quantitative data but not on categorical data._**

Don’t confuse the data with the characters used to represent that data.
- What is the average age of everyone in the class?
- What is the average eye color?
- In our Taiwan example data, “gender” is coded as either a 1 or 2. 
    - What is the average gender in the dataset? 
    - It doesn’t matter what character is used to represent it, it is still categorical.

A good place in the pandas documentation for more info on working with categorical data:

http://pandas.pydata.org/pandas-docs/stable/categorical.html#categorical

### Converting to categorical
As mentioned, categorical data is often represented by numbers. Look back at our first example with the student and class data. 

What is the data type for the `Class` variable? Is that appropriate?

In [10]:
# Inspect the dataset
example_df2.info()
print(example_df2)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 6
Data columns (total 5 columns):
Student           6 non-null int64
Dept              6 non-null object
Class             6 non-null int64
Grade             5 non-null object
Date completed    5 non-null object
dtypes: int64(2), object(3)
memory usage: 288.0+ bytes
   Student  Dept  Class Grade Date completed
0      101   MIS   3335     A      4/28/2018
1      101  MGMT   4347     B            NaN
3      102  MGMT   4347     C      4/27/2018
4      102   MIS   3328     A       5/1/2018
5      103  MGMT   4347   NaN      4/28/2018
6      103  QMTH   3335     D       5/3/2018


The data in the Class column should be changed to make sure it is treated as a categorical variable. We can create a new column as a categorical type for this purpose.

In [11]:
# Example: create new column
example_df2['Class_Cat'] = example_df2['Class'].astype('category')
example_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 6
Data columns (total 6 columns):
Student           6 non-null int64
Dept              6 non-null object
Class             6 non-null int64
Grade             5 non-null object
Date completed    5 non-null object
Class_Cat         6 non-null category
dtypes: category(1), int64(2), object(3)
memory usage: 398.0+ bytes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In this dataset, pandas can infer an appropriate sort order for `Class_Cat`. In other categorical data you might need to specify the sort order. See the pandas categorical documentation for more info.

The `inplace` option changes the order of the data in the DataFrame itself.

In [12]:
# Example: sort by Class_Cat
example_df2.sort_values(inplace=True, by="Class_Cat")
print(example_df2)

   Student  Dept  Class Grade Date completed Class_Cat
4      102   MIS   3328     A       5/1/2018      3328
0      101   MIS   3335     A      4/28/2018      3335
6      103  QMTH   3335     D       5/3/2018      3335
1      101  MGMT   4347     B            NaN      4347
3      102  MGMT   4347     C      4/27/2018      4347
5      103  MGMT   4347   NaN      4/28/2018      4347


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [13]:
# Sort by the index - but it only shows in the print, not the df
print(example_df2.sort_index())
print()
print(example_df2)

   Student  Dept  Class Grade Date completed Class_Cat
0      101   MIS   3335     A      4/28/2018      3335
1      101  MGMT   4347     B            NaN      4347
3      102  MGMT   4347     C      4/27/2018      4347
4      102   MIS   3328     A       5/1/2018      3328
5      103  MGMT   4347   NaN      4/28/2018      4347
6      103  QMTH   3335     D       5/3/2018      3335

   Student  Dept  Class Grade Date completed Class_Cat
4      102   MIS   3328     A       5/1/2018      3328
0      101   MIS   3335     A      4/28/2018      3335
6      103  QMTH   3335     D       5/3/2018      3335
1      101  MGMT   4347     B            NaN      4347
3      102  MGMT   4347     C      4/27/2018      4347
5      103  MGMT   4347   NaN      4/28/2018      4347


In [14]:
# Return the data to the original order


In [15]:
## Day 2
# Restart work on Taiwan data dataframe
import numpy as np
import pandas as pd

# read data into memory
df = pd.read_excel("Taiwan_CellSurvey_RAW.xlsx", index_column="SurveyID")

# remove duplicates and save to df2
df2 = df.drop_duplicates()

# verify df looks the way we expect
df2.head()

Unnamed: 0,SurveyID,Age,Gender,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,...,PBI2,PBI3,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001
0,S4,,2,1,3,1.0,2.0,,,,...,2,4,1,6.3,2.1,6,4,6,,3
1,S5,,1,1,3,2.0,2.0,,10.0,10.0,...,2,4,2,4.9,2.8,4,3,4,,3
2,S8,,2,1,2,2.0,2.0,,25.0,25.0,...,3,3,1,4.2,2.8,5,5,5,,2
3,S12,,2,1,2,3.0,2.0,,12.0,12.0,...,1,4,1,5.6,1.4,5,5,5,,2
4,S13,,2,1,3,,2.0,,10.0,10.0,...,0,0,0,0.0,0.0,4,4,4,,3


### Example: Taiwan survey data, part 2
The gender column should be categorical. Use what we've done so far to convert it. Verify that it has been changed.

In [16]:
# Example: convert Taiwan gender column to categorical
# review what our column structure looks like.
print(df2.columns)

Index(['SurveyID', 'Age', 'Gender', 'Type', 'Education', 'Income', 'Employed',
       'FullPart', '@1OwnCell', '@2UsedCell', '@3OwnHousePhone', '@4UseFreq',
       'BQ4a', 'BQ4aCont', '@5UseLength', 'BQ5a', 'BQ5aCont', '@6aTexting',
       '@6bEmail', '@6cInternet', '@6dBank', '@6eBills', '@6fFacebook',
       '@6gPics', '@6hGames', '@6iBuy', '@7PurchOnline', 'BQ7a', 'BQ7aCont',
       '@8CellExper', 'BQ8a', 'BQ8b', '@9Comfort', 'BQ9a', '@10Satisfaction',
       'BQ10a', 'PA1', 'PA2', 'PA3', 'PA4', 'PA5', 'PA6', 'PA7', 'PBI1',
       'PBI2', 'PBI3', 'PBI4', 'PBIUse', 'PBIBuy', 'PGL1', 'PGL2', 'PGL3',
       'Comments', 'VAR00001'],
      dtype='object')


In [17]:
# Can we change the original column name first?
# If we could, we could reuse the original name.
df2.rename(index = str, columns = {'Gender': 'Gender_RAW'}, inplace = True)
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 268 entries, 0 to 271
Data columns (total 54 columns):
SurveyID           268 non-null object
Age                254 non-null float64
Gender_RAW         268 non-null int64
Type               268 non-null int64
Education          268 non-null int64
Income             259 non-null float64
Employed           267 non-null float64
FullPart           78 non-null float64
@1OwnCell          257 non-null float64
@2UsedCell         257 non-null float64
@3OwnHousePhone    262 non-null float64
@4UseFreq          266 non-null float64
BQ4a               266 non-null float64
BQ4aCont           266 non-null float64
@5UseLength        264 non-null float64
BQ5a               264 non-null float64
BQ5aCont           264 non-null float64
@6aTexting         198 non-null float64
@6bEmail           88 non-null float64
@6cInternet        82 non-null float64
@6dBank            15 non-null float64
@6eBills           14 non-null float64
@6fFacebook        30 non-null f

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [18]:
# Copy and convert the int64 column Gender_RAW values to a categorical column named Gender
df2['Gender'] = df2['Gender_RAW'].astype('category')
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 268 entries, 0 to 271
Data columns (total 55 columns):
SurveyID           268 non-null object
Age                254 non-null float64
Gender_RAW         268 non-null int64
Type               268 non-null int64
Education          268 non-null int64
Income             259 non-null float64
Employed           267 non-null float64
FullPart           78 non-null float64
@1OwnCell          257 non-null float64
@2UsedCell         257 non-null float64
@3OwnHousePhone    262 non-null float64
@4UseFreq          266 non-null float64
BQ4a               266 non-null float64
BQ4aCont           266 non-null float64
@5UseLength        264 non-null float64
BQ5a               264 non-null float64
BQ5aCont           264 non-null float64
@6aTexting         198 non-null float64
@6bEmail           88 non-null float64
@6cInternet        82 non-null float64
@6dBank            15 non-null float64
@6eBills           14 non-null float64
@6fFacebook        30 non-null f

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [19]:
# Check the values for Gender
df2.head(8)

Unnamed: 0,SurveyID,Age,Gender_RAW,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,...,PBI3,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001,Gender
0,S4,,2,1,3,1.0,2.0,,,,...,4,1,6.3,2.1,6,4,6,,3,2
1,S5,,1,1,3,2.0,2.0,,10.0,10.0,...,4,2,4.9,2.8,4,3,4,,3,1
2,S8,,2,1,2,2.0,2.0,,25.0,25.0,...,3,1,4.2,2.8,5,5,5,,2,2
3,S12,,2,1,2,3.0,2.0,,12.0,12.0,...,4,1,5.6,1.4,5,5,5,,2,2
4,S13,,2,1,3,,2.0,,10.0,10.0,...,0,0,0.0,0.0,4,4,4,,3,2
5,S14,,2,1,3,5.0,1.0,1.0,10.0,10.0,...,4,2,6.3,3.5,9,8,8,,3,2
6,S15,,2,1,2,3.0,1.0,2.0,,,...,3,3,4.9,4.2,7,7,7,,2,2
7,S16,,2,1,3,,2.0,,10.0,10.0,...,3,2,4.9,3.5,4,4,4,,3,2


Here is another concern: some algorithms need a variable like `Gender` to be coded with binary data (i.e., only 1 and 0). We need to convert the 1s and 2s in the data to 0s and 1s.

In [32]:
# Change the category values from 1 & 2 to 0 & 1
# BE VERY CAREFUL WHEN DOING THIS SO YOU DON'T ACCIDENTALLY CHANGE THE MEANING OF THE DATA.
df2['Gender'].cat.rename_categories({2:1, 1:0}, inplace = True)


ValueError: Categorical categories must be unique

In [21]:
df2.head()

Unnamed: 0,SurveyID,Age,Gender_RAW,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,...,PBI3,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001,Gender
0,S4,,2,1,3,1.0,2.0,,,,...,4,1,6.3,2.1,6,4,6,,3,1
1,S5,,1,1,3,2.0,2.0,,10.0,10.0,...,4,2,4.9,2.8,4,3,4,,3,0
2,S8,,2,1,2,2.0,2.0,,25.0,25.0,...,3,1,4.2,2.8,5,5,5,,2,1
3,S12,,2,1,2,3.0,2.0,,12.0,12.0,...,4,1,5.6,1.4,5,5,5,,2,1
4,S13,,2,1,3,,2.0,,10.0,10.0,...,0,0,0.0,0.0,4,4,4,,3,1


We may not want the `Gender_RAW` column in our DataFrame anymore since it contains values we can't use. If that is the case, we can use the `drop()` method to remove the column.

In [22]:
# Example: drop Gender_RAW from the DataFrame
df.drop('Gender_Raw', axis = 1, inplace = True)
#df.head(3)

KeyError: "labels ['Gender_Raw'] not contained in axis"

Of course now it would be much more convenient if `Gender` were near the left of the DataFrame. We can move it to any location we want. It is a three step process shown below.

In [23]:
# Example: rearrange the columns to move Gender to the left
# First: get current list of columns and save to variable
print(df2.columns)

Index(['SurveyID', 'Age', 'Gender_RAW', 'Type', 'Education', 'Income',
       'Employed', 'FullPart', '@1OwnCell', '@2UsedCell', '@3OwnHousePhone',
       '@4UseFreq', 'BQ4a', 'BQ4aCont', '@5UseLength', 'BQ5a', 'BQ5aCont',
       '@6aTexting', '@6bEmail', '@6cInternet', '@6dBank', '@6eBills',
       '@6fFacebook', '@6gPics', '@6hGames', '@6iBuy', '@7PurchOnline', 'BQ7a',
       'BQ7aCont', '@8CellExper', 'BQ8a', 'BQ8b', '@9Comfort', 'BQ9a',
       '@10Satisfaction', 'BQ10a', 'PA1', 'PA2', 'PA3', 'PA4', 'PA5', 'PA6',
       'PA7', 'PBI1', 'PBI2', 'PBI3', 'PBI4', 'PBIUse', 'PBIBuy', 'PGL1',
       'PGL2', 'PGL3', 'Comments', 'VAR00001', 'Gender'],
      dtype='object')


In [None]:
# Second: copy the list to this variable and reorder them the way you want them.
cols2 = ['SurveyID', 'Age', 'Gender_RAW', 'Type', 'Education', 'Income',
       'Employed', 'FullPart', '@1OwnCell', '@2UsedCell', '@3OwnHousePhone',
       '@4UseFreq', 'BQ4a', 'BQ4aCont', '@5UseLength', 'BQ5a', 'BQ5aCont',
       '@6aTexting', '@6bEmail', '@6cInternet', '@6dBank', '@6eBills',
       '@6fFacebook', '@6gPics', '@6hGames', '@6iBuy', '@7PurchOnline', 'BQ7a',
       'BQ7aCont', '@8CellExper', 'BQ8a', 'BQ8b', '@9Comfort', 'BQ9a',
       '@10Satisfaction', 'BQ10a', 'PA1', 'PA2', 'PA3', 'PA4', 'PA5', 'PA6',
       'PA7', 'PBI1', 'PBI2', 'PBI3', 'PBI4', 'PBIUse', 'PBIBuy', 'PGL1',
       'PGL2', 'PGL3', 'Comments', 'VAR00001', 'Gender']

In [None]:
# Finally: apply the new column order and verify
df2 = df2[cols2]
df.head(3)

## Working with dates


Dates can be a problem in any dataset. They are stored as numeric values, so correctly interpreting their values depends upon the underlying standard of their source. Did you know Excel uses two different start dates? https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel

There are also cultural diffences in how dates are represented. Dates in Europe are written with the day of the month first, then the month and year.

How about time zones? If time and dates are important pieces of your data, you might have to take the time zone into account. There are 4 time zones just across the contiguous 48 United States. Add Alaska, Hawai'i, and US possessions and the number is much larger. What time is it in Guam right now? https://www.timeanddate.com/worldclock/guam

You might need to use UTC (coordinated universal time) or GMT (Greenwich mean time) as a basis to standardize any time data you encounter. Or you might need to adjust formatting for an international audience. For any operations involving dates and times, Python provides the `datetime` module with many usefule functions built in.
https://docs.python.org/3/library/datetime.html


In [None]:
# Of course we need to import the module
import datetime as dt

In [None]:
# Create a datetime object to hold the current date and time
now = dt.datetime.now()
print(now)

That may not be too helpful since it is difficult to read. The `datetime` object includes a method called `strftime` that allows you to specify a formatting string to control the display of the date and time. A list of the formatting options is avail able in the documentation: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

In [None]:
# Print a formatted date using strftime
# US style
print(now.strftime('%a, %B %d %Y'))
# European style
print(now.strftime('%a, %d %B %Y'))

In [None]:
# Example: time transformation

# Shift the time to 2 hours from now 
timevalue = now + dt.timedelta(hours = 2)

# Get the difference between the two times
print(now.strftime('%H:%M:%S'))
print(timevalue.strftime('%H:%M:%S'))

The `timedelta()` method also includes parameters for weeks, days, minutes, seconds, and smaller time units. It can be used to make adjustments to times and dates.

There is also a `timedelta` object that provides some of the same functionality.

## Data aggregation
A big part of working with large datasets is looking at summarizations of the data. You know from statistics that we describe data with quantities like mean, median, min, and max. With these quantities, we get some insight into the data from a single number.

When we use these aggregating functions with a DataFrame we get a value for each column with numeric data.

|<p style="text-align:left;">pandas Aggregation Method</p>|  | <p style="text-align:left;">Result</p>|
| --- | --- | --- |
|<p style="text-align:left;">count()</p> | |<p style="text-align:left;">Total number of items</p>|
|<p style="text-align:left;">first(), last()</p> | |<p style="text-align:left;">First and last item</p>|
|<p style="text-align:left;">sum()</p> | |<p style="text-align:left;">Sum of all items</p>|
|<p style="text-align:left;">mean(), median()</p> | |<p style="text-align:left;">Mean and median</p>|
|<p style="text-align:left;">min(), max()</p>  | |<p style="text-align:left;">Minimum and maximum items</p>|
|<p style="text-align:left;">std(), var()</p>  | |<p style="text-align:left;">Standard deviation and variance</p>|


In [None]:
# Example: aggregation with DataFrame
print(df2.mean())
print()
print(df.median())

We can use the methods on individual columns as well.

In [None]:
# Example: find the mean of the AGE column
df2['Age'].mean()

Some of these aggregations are included in the `describe()` method. This is another good tool to learn about your data.

In [None]:
# Example: describe() method
df2.describe()

Where are `SurveyID` and `Comments`? The default behavior for `describe()` is to only list the numeric variables. We saw in the output from the `info()` method that they were of the _object_ data type. We have to either specifically request they be printed (using the `include` option) or ask that the numbers _not_ be printed (using `exclude`).

In [None]:
# Example: show nonnumbers in describe()
df2.info()
print(df2.describe(include = [np.object]))
print()
print(df2.describe(exclude = [np.number]))

And we can also get appropriate info on individual columns.

In [None]:
# Example: individual non-numeric column
df2['SurveyID'].count()


In [None]:
# What happens if we try to obtain an inappropriate aggregation?
df2['SurveyID'].count()#this is will be an error

### Conditional aggregation: the GroupBy object
Those basic aggregation methods can give us quite a bit of insight into our data. Sometimes we want to see some of those aggregations based upon some sort of gouping of the data. pandas implements a `groupby` method of the DataFrame to do just that.

http://pandas.pydata.org/pandas-docs/stable/groupby.html

If the name 'groupby' sounds familiar, it was adopted from the same command in the SQL language.

Look familiar?
```
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
```

In SQL and in pandas, `groupby` allows you to apply aggregation to selected subsets of the data. 

In [None]:
# Example: count all column values by gender
df2.groupby('Gender').count()

That last statement creates a `groupby` object and then applies the `count()` method to it but the grouped version of the data is not saved. The `groupby` object can be created with a name and subsequently used in multiple operations. Think of it as a view of the data, lilke a SQL query.

In [None]:
# Example: create the groupby object for further use
grouped_gender = df2.groupby('Gender')

# print only the gender = 0 group
print(grouped_gender.get_group(0))

The `aggregate()` method is a flexible way to specify some of these calculations on a `groupby` object. It can accept a string, function, or list of either and compute all of those aggregates at once. 

In [None]:
# Example: use aggregate() on a groupby object
grouped_gender.aggregate([min, np.median, max])

## Missing data
In almost every dataset you may have the opportunity to work with (outside of a classroom anyway) there will be missing data. It can be missing for any number of reasons. It may simply be that a survey respondent did not answer that question. Or it could be expected due to the design of the survey.

In our Taiwan cell phone example data a value may be missing because the question on the survey was optional. Consider this set of questions: 
```
1. Are you currently employed? Yes or No
2. If you answered "Yes" to question 1, do you work full-time or part-time? Full-time or Part-time
```

People who are not employed will not answer the second question so there will be an expected missing value there. In fact, if you really want to get into the data, you would want to make sure there were no answers on #2 for anyone who answered "No" for #1.

Whatever the reason you can expect to see missing data. But that doesn't mean you need to immediately throw out any responses with missing values. If it is expected, you can use that data for a specialized purpose. If it is not expected, there are other ways to work with it. 

We have already seen some indications of missing data in the Taiwan dataset. You should have an idea of how you will deal with them before you begin your analysis. There are really three things you must address when it comes to handling missing values in your dataset.
1. You have to find the missing values.
2. You have to code them consistently (give them all the same representation).
3. Do something with them.




### Finding missing values
To find missing values we look for null values. If we use the `isnull()` method, we get a table of True or False answers to the question "does this cell contain a null value?" It would be more useful to check individual fields for nulls since we know many of the fields will have null values by design.

In [None]:
# Example: finding null values
#print(df2.isna())
#print(df2.isnull())
#print(df2.count(axis = 0))     #gives us a count of all the non-null values


### Handling missingness
We've already seen the `NaN` value in some of the views of our dataset. That is the pandas default way of representing a missing (or null) value. The Python default is `None`. Some people like to use a value that is valid but couldn't be mistaken for an actual data point, like `-999`. In any case, a consistent scheme for encoding missingness should be adopted to avoid potential problems.

Once we know there are missing values in our data and we have them coded consistently, we have to decide what to do about them. There are three possibilities:
1. ignore the missing data
2. drop the rows with missing data from the dataset
3. fill in the missing data points

Ignoring the problem (as with most problems) may lead to unanticipated consequences for your analysis, so we seldom do that. We can use the `dropna()` method to remove the rows with missing data from the dataframe, but that also can be a problem. Most of the time we will fill in the missing data.

Discussions of all the statistical tools available to replace null values in our data is well beyond the scope of our class. However, one simple method (that might be appropriate in a simple dataset) is to replace missing values with the mean of the other values. Obviously this only works for numeric fields, but the median and mode can be used in both numeric and text fields.

In [25]:
# Example: display the mean, median, and mode for the Age column
#print('Mean:   ', df2['Age'].mean())
print('Mean:   ', round(df2['Age'].mean(), 1))
print('Median: ', df2['Age'].median())
print('Mode:   ', df2['Age'].mode())

Mean:    63.5
Median:  62.0
Mode:    0    60.0
dtype: float64


In [37]:
# Example: replace missing values with mean, median or mode
fill_values = {'Age': round(df2['Age'].mean(), 1)}
#fill_values = {'Age': df2['Age'].median()}
#fill_values = {'Age': df2['Age'].mode()}
                            
print(fill_values)
df2.fillna(value = fill_values)

{'Age': 63.5}


Unnamed: 0,SurveyID,Age,Gender_RAW,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,...,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001,Gender,PAnxiety
0,S4,63.5,2,1,3,1.0,2.0,,,,...,1,6.3,2.1,6,4,6,,3,1,3.333333
1,S5,63.5,1,1,3,2.0,2.0,,10.00,10.00,...,2,4.9,2.8,4,3,4,,3,0,3.666667
2,S8,63.5,2,1,2,2.0,2.0,,25.00,25.00,...,1,4.2,2.8,5,5,5,,2,1,2.000000
3,S12,63.5,2,1,2,3.0,2.0,,12.00,12.00,...,1,5.6,1.4,5,5,5,,2,1,2.000000
4,S13,63.5,2,1,3,,2.0,,10.00,10.00,...,0,0.0,0.0,4,4,4,,3,1,0.000000
5,S14,63.5,2,1,3,5.0,1.0,1.0,10.00,10.00,...,2,6.3,3.5,9,8,8,,3,1,0.666667
6,S15,63.5,2,1,2,3.0,1.0,2.0,,,...,3,4.9,4.2,7,7,7,,2,1,2.333333
7,S16,63.5,2,1,3,,2.0,,10.00,10.00,...,2,4.9,3.5,4,4,4,,3,1,2.333333
8,S35,63.5,1,1,4,5.0,1.0,1.0,10.25,10.25,...,4,5.6,5.6,8,8,7,,4,0,1.000000
9,S37,63.5,1,1,4,5.0,1.0,1.0,8.00,8.00,...,1,5.6,2.1,5,5,4,,4,0,4.000000


In [None]:
# Example: look at the data frame header to see the table itself hasn't changed.
# Need to add the inplace option to change the underlying data.
df2.head()

In [36]:
# Example: replace missing values inplace with our choice of mean, median or mode
# Which version of fill_values shall we use?
fill_values = {'Age': round(df2['Age'].mean(), 1)}
df2.fillna(value = fill_values, inplace = True)
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Unnamed: 0,SurveyID,Age,Gender_RAW,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,...,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001,Gender,PAnxiety
0,S4,63.5,2,1,3,1.0,2.0,,,,...,1,6.3,2.1,6,4,6,,3,1,3.333333
1,S5,63.5,1,1,3,2.0,2.0,,10.0,10.0,...,2,4.9,2.8,4,3,4,,3,0,3.666667
2,S8,63.5,2,1,2,2.0,2.0,,25.0,25.0,...,1,4.2,2.8,5,5,5,,2,1,2.0
3,S12,63.5,2,1,2,3.0,2.0,,12.0,12.0,...,1,5.6,1.4,5,5,5,,2,1,2.0
4,S13,63.5,2,1,3,,2.0,,10.0,10.0,...,0,0.0,0.0,4,4,4,,3,1,0.0


## Feature creation
We've already seen that sometimes the original data isn't in the form that helps us best conduct the analysis we want. That can include some the issues we've seen so far like numeric variables that need to be categorical or it could be that the variable we really want has to be created from some of the data. The variables (i.e., columns) we use for our analysis are often referred to as _features_. 

The features that might need to be created will depend upon the data at hand and the analysis goals, but there are some general kinds of operations you might need to do to create the features you need. They include situations like these:
- Daily date data is available but you really need to extract year values or to indicate the quarter.
- Continuous numeric data like income or sales might benefit from _binning_, which is a conversion of continuous data into categories (or bins).
- The raw data contains the components of the feature you need. For example, you may have dividend per share and earnings per share data for stocks and you need the dividend payout ratio for the analysis.

This last example is one we will work on further. First, let's review what the df2 DataFrame currently looks like.

In [None]:
# Example: add new calculated column
df2.info()

# Review columns available

We know from a description of our data that PA1, PA2, and PA3 are all indicators of a construct we are studying. We also know that we want to combine them into one measure called 'perceived anxiety' for further analysis. One way we can do that is by creating a new column containing the average of the component values. We can do it using either the `average()` method built-in to NumPy or we can manually code the calculation.

Of course the creation of a new feature may involve more (or less) complex calculations than just an average. Use whatever methods are appropriate to the situation.

In [28]:
# Example: create PAnxiety from PA1, PA2, and PA3
# Do it with built-in function
# The axis=0 argument says to average on each row
df2['PAnxiety'] = np.average((df2['PA1'], df2['PA2'], df2['PA3']), axis = 0)
# Do it manually
#df2['PAnxiety'] = round((df2['PA1'], df2['PA2'], df2['PA3'])/3,2)

# Verify the addition of the new column
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,SurveyID,Age,Gender_RAW,Type,Education,Income,Employed,FullPart,@1OwnCell,@2UsedCell,...,PBI4,PBIUse,PBIBuy,PGL1,PGL2,PGL3,Comments,VAR00001,Gender,PAnxiety
0,S4,,2,1,3,1.0,2.0,,,,...,1,6.3,2.1,6,4,6,,3,1,3.333333
1,S5,,1,1,3,2.0,2.0,,10.0,10.0,...,2,4.9,2.8,4,3,4,,3,0,3.666667
2,S8,,2,1,2,2.0,2.0,,25.0,25.0,...,1,4.2,2.8,5,5,5,,2,1,2.0
3,S12,,2,1,2,3.0,2.0,,12.0,12.0,...,1,5.6,1.4,5,5,5,,2,1,2.0
4,S13,,2,1,3,,2.0,,10.0,10.0,...,0,0.0,0.0,4,4,4,,3,1,0.0


We should verify some of the values to make sure our calculation worked as expected. To make that a little easier we can  print just the columns of interest.

In [None]:
# Example: verifying calculation of PAnxiety
print(df2[['PA1', 'PA2', 'PA3', 'PAnxiety']])