# Exploratory Data Analysis

- Exploratory data analysis (EDA)-  an approach to analyzing data sets to summarize their main characteristics, often with visual methods. 
- A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. 
- Exploratory data analysis includes the following steps:
    * Specifying assumptions and detecting data errors
    * Identifying relations between variables of interest
    * Formulating a testable hypothesis 

## Assumptions and Errors
The first step is to understand what we can assume about the dataset that we have. Some of the assumptions are extrinsic:
* How was the data collected?
* What was the data collection process trying to achieve? 
* How might the data collection process be biased?

Some of the assumptions are intrinsic:
* Data types of columns
* Relationships between columns
* Any more .. ?

In [8]:
import pandas as pd
df = pd.read_csv('ssn.csv', delimiter=' ') # what does this do?

In [2]:
# what does this do?
pd.options.display.max_rows

60

In [3]:
pd.set_option("display.max_rows", 999)
pd.set_option("display.max_columns", 15)

In [9]:
df.head()

Unnamed: 0,Name,Gender,Social,Age,Acct
0,Arthur-Crawford,male,092-30-2294,47,3882717958320846
1,Anthony-Dove,male,729-04-3185,54,7141086558511739
2,Noemi-Teel,female,117-67-4531,59,13061500012440717
3,Kim-Lovell,female,551-01-1394,51,1716351605690022
4,Kenneth-Gallagher,male,322-79-2084,36,13699495317963956


**What kind of constraints do we have on this data?**
- we may know that age has to be a positive number between 0 and 122.

In [10]:
df[df['Age'] < 0] # but then this 

Unnamed: 0,Name,Gender,Social,Age,Acct
24,Robert-Prater,male,672-85-4374,-1,6251379381664522


In [11]:
df[df['Age'] >122] # ok

Unnamed: 0,Name,Gender,Social,Age,Acct


**We also know that social security numbers must be unique**
-  How do we find uniqueness errors?

In [12]:
df.groupby('Social')['Social'].count()

Social
006-18-9627    1
008-99-9285    1
013-28-7669    1
020-89-8052    1
021-16-5299    1
046-79-1082    1
048-49-9559    1
071-31-8468    1
075-32-4227    1
078-56-2172    1
080-25-9480    1
081-61-9579    1
092-30-2294    2
094-20-6884    1
097-13-0267    1
106-67-9614    1
112-94-4837    1
114-47-3920    1
117-67-4531    1
150-20-8678    1
175-97-7421    1
182-70-5718    1
195-33-7520    1
213-60-5527    1
219-88-4029    1
236-81-1635    1
237-30-6965    1
242-08-0445    1
253-40-4221    1
275-17-1922    1
284-63-1932    1
317-55-7155    1
320-11-5940    1
322-79-2084    1
323-92-6940    1
330-14-7912    1
331-02-9899    1
332-35-9305    1
338-85-2951    1
352-30-5893    1
361-16-4912    1
371-49-4601    1
390-36-8012    1
421-42-1071    1
432-07-4044    1
448-35-0031    1
474-53-5002    1
488-12-8847    1
494-41-9206    1
514-23-7501    1
528-40-1365    1
529-02-1521    1
534-24-4869    1
545-96-3499    1
551-01-1394    1
554-02-9878    1
594-16-5074    1
594-64-2607    1
596-36-

**what does this do?**

In [14]:
df['Social'].value_counts() 

092-30-2294    2
097-13-0267    1
844-18-9887    1
021-16-5299    1
080-25-9480    1
213-60-5527    1
488-12-8847    1
938-00-4773    1
603-95-5228    1
545-96-3499    1
727-24-9980    1
284-63-1932    1
323-92-6940    1
596-36-3305    1
114-47-3920    1
048-49-9559    1
554-02-9878    1
791-25-6633    1
534-24-4869    1
771-98-4161    1
594-16-5074    1
801-08-4437    1
950-04-7562    1
078-56-2172    1
722-55-6735    1
046-79-1082    1
242-08-0445    1
918-30-2569    1
693-63-9152    1
846-57-2998    1
275-17-1922    1
081-61-9579    1
905-85-5564    1
966-26-8444    1
317-55-7155    1
236-81-1635    1
529-02-1521    1
594-64-2607    1
742-21-1327    1
195-33-7520    1
237-30-6965    1
514-23-7501    1
612-63-5099    1
071-31-8468    1
885-32-1886    1
361-16-4912    1
106-67-9614    1
331-02-9899    1
219-88-4029    1
713-01-1423    1
729-04-3185    1
767-25-8914    1
922-22-4679    1
912-78-9636    1
112-94-4837    1
332-35-9305    1
895-52-4350    1
619-57-0193    1
842-76-9108   

In [13]:
social_count = df.groupby('Social')['Social'].count()
social_count[social_count > 1]

Social
092-30-2294    2
Name: Social, dtype: int64

In [16]:
social_count[social_count > 1].index

Index(['092-30-2294'], dtype='object', name='Social')

In [18]:
for v in social_count[social_count > 1].index:
    print(v)

092-30-2294


In [14]:
for v in social_count[social_count > 1].index:
    print(v)
    print(df[df['Social'] == v])

092-30-2294
               Name Gender       Social  Age               Acct
0   Arthur-Crawford   male  092-30-2294   47   3882717958320846
19    Curtis-Martin   male  092-30-2294   50  10957389913694997


**even better**

In [21]:
t = '092-30-2294'
df[df['Social'] == t]

Unnamed: 0,Name,Gender,Social,Age,Acct
0,Arthur-Crawford,male,092-30-2294,47,3882717958320846
19,Curtis-Martin,male,092-30-2294,50,10957389913694997


## Pandas Merge

Let's spend some time understanding how the merge function works in pandas. Pandas merge is equivalent to an `equality join` operation for those of you have taken a database course. Let's think of a simple example:

In [15]:
table1 = [{'name': 'John Doe', 'category': 'A'}, 
          {'name': 'Jane Smith', 'category': 'B'}, 
          {'name': 'Alex Taylor', 'category': 'A'},
          {'name': 'Brett Daniels', 'category': 'C'}]

table1_df = pd.DataFrame(table1)
table1_df

Unnamed: 0,name,category
0,John Doe,A
1,Jane Smith,B
2,Alex Taylor,A
3,Brett Daniels,C


In [24]:
table2 = [{'salary': 1000, 'category': 'A'}, 
          {'salary': 900, 'category': 'B'}, 
          {'salary': 500, 'category': 'C'}]
table2_df = pd.DataFrame(table2)
table2_df

Unnamed: 0,salary,category
0,1000,A
1,900,B
2,500,C


In [25]:
table1_df.merge(table2_df, on='category')

Unnamed: 0,name,category,salary
0,John Doe,A,1000
1,Alex Taylor,A,1000
2,Jane Smith,B,900
3,Brett Daniels,C,500


In [27]:
merged = pd.merge(table1_df, table2_df) # also this way, no need to specify 'on'
merged

Unnamed: 0,name,category,salary
0,John Doe,A,1000
1,Alex Taylor,A,1000
2,Jane Smith,B,900
3,Brett Daniels,C,500


This function is commutative:

In [28]:
table2_df.merge(table1_df)

Unnamed: 0,salary,category,name
0,1000,A,John Doe
1,1000,A,Alex Taylor
2,900,B,Jane Smith
3,500,C,Brett Daniels


This function merges the two tables together on the category column and automatically removes the redundancy (1 single column is left). The behavior of this function can be subtle. Suppose, we change the category field to D for one of the rows:

In [31]:
table1 = [{'name': 'John Doe', 'category': 'A'}, 
          {'name': 'Jane Smith', 'category': 'B'}, 
          {'name': 'Alex Taylor', 'category': 'D'},
          {'name': 'Brett Daniels', 'category': 'C'}]

table1_df = pd.DataFrame(table1)
table1_df

Unnamed: 0,name,category
0,John Doe,A
1,Jane Smith,B
2,Alex Taylor,D
3,Brett Daniels,C


In [32]:
table1_df.merge(table2_df)

Unnamed: 0,name,category,salary
0,John Doe,A,1000
1,Jane Smith,B,900
2,Brett Daniels,C,500


The row gets dropped from the result! In the basic operating mode of the merge command any row that doesn't have a match gets dropped. There is a key word `how` that can modify this behavior. Suppose, we want the left rows that don't match:

In [35]:
table1_df.merge(table2_df, on='category', how='left')

Unnamed: 0,name,category,salary
0,John Doe,A,1000.0
1,Jane Smith,B,900.0
2,Alex Taylor,D,
3,Brett Daniels,C,500.0


It returns those rows but with any additional columns null or nan, depending on the data type. If you set how to right you'll get the same answer as before (why?)

In [36]:
table1_df.merge(table2_df, on='category', how='right')

Unnamed: 0,name,category,salary
0,John Doe,A,1000
1,Jane Smith,B,900
2,Brett Daniels,C,500


## Checking Complex Conditions

How would you use a Pandas merge to generate an all pairs dataset?

In [21]:
def all_pairs(df):
    new_df = df.copy() # make a copy of the data frame
    new_df['dummy'] = 1
    
    return new_df.merge(new_df, on='dummy')

all_pairs(table1_df)

Unnamed: 0,name_x,category_x,dummy,name_y,category_y
0,John Doe,A,1,John Doe,A
1,John Doe,A,1,Jane Smith,B
2,John Doe,A,1,Alex Taylor,D
3,John Doe,A,1,Brett Daniels,C
4,Jane Smith,B,1,John Doe,A
5,Jane Smith,B,1,Jane Smith,B
6,Jane Smith,B,1,Alex Taylor,D
7,Jane Smith,B,1,Brett Daniels,C
8,Alex Taylor,D,1,John Doe,A
9,Alex Taylor,D,1,Jane Smith,B


Let's go back to our original dataset

In [22]:
pair_df = all_pairs(df)
pair_df

Unnamed: 0,Name_x,Gender_x,Social_x,Age_x,Acct_x,dummy,Name_y,Gender_y,Social_y,Age_y,Acct_y
0,Arthur-Crawford,male,092-30-2294,47,3882717958320846,1,Arthur-Crawford,male,092-30-2294,47,3882717958320846
1,Arthur-Crawford,male,092-30-2294,47,3882717958320846,1,Anthony-Dove,male,729-04-3185,54,7141086558511739
2,Arthur-Crawford,male,092-30-2294,47,3882717958320846,1,Noemi-Teel,female,117-67-4531,59,13061500012440717
3,Arthur-Crawford,male,092-30-2294,47,3882717958320846,1,Kim-Lovell,female,551-01-1394,51,1716351605690022
4,Arthur-Crawford,male,092-30-2294,47,3882717958320846,1,Kenneth-Gallagher,male,322-79-2084,36,13699495317963956
...,...,...,...,...,...,...,...,...,...,...,...
9995,Jane-Williamson,female,432-07-4044,24,488972166555225,1,Ivory-Curry,male,275-17-1922,47,19231534736287234
9996,Jane-Williamson,female,432-07-4044,24,488972166555225,1,Victor-Armstrong,male,846-57-2998,53,10498060972555052
9997,Jane-Williamson,female,432-07-4044,24,488972166555225,1,Annie-Mayweather,female,693-63-9152,38,5563786638578513
9998,Jane-Williamson,female,432-07-4044,24,488972166555225,1,Tracy-Zampieri,female,918-30-2569,44,1142720749472163


In [79]:
pair_df[(pair_df['Social_x'] == pair_df['Social_y']) & (pair_df['Acct_x'] != pair_df['Acct_y'])]

Unnamed: 0,Name_x,Gender_x,Social_x,Age_x,Acct_x,dummy,Name_y,Gender_y,Social_y,Age_y,Acct_y
19,Arthur-Crawford,male,092-30-2294,47,3882717958320846,1,Curtis-Martin,male,092-30-2294,50,10957389913694997
1900,Curtis-Martin,male,092-30-2294,50,10957389913694997,1,Arthur-Crawford,male,092-30-2294,47,3882717958320846


Why did we use the single (bitwise) and? It's because the python `and` is ambiguous for an entire boolean mask. This works for more complicated examples as well:

In [23]:
emps = [{'name': 'John Doe', 'rank': 'Manager', 'salary': 100}, 
          {'name': 'Jane Smith', 'rank': 'Manager', 'salary': 55}, 
          {'name': 'Alex Taylor', 'rank': 'Employee', 'salary': 32},
          {'name': 'Brett Daniels', 'rank': 'Employee', 'salary': 57}]
emps_df = pd.DataFrame(emps)
emps_df

Unnamed: 0,name,rank,salary
0,John Doe,Manager,100
1,Jane Smith,Manager,55
2,Alex Taylor,Employee,32
3,Brett Daniels,Employee,57


Suppose, we have the integrity constraint: no manager can earn less than an employee.

In [24]:
pair_df = all_pairs(emps_df)
pair_df[(pair_df['rank_x'] == 'Manager') &(pair_df['rank_y'] == 'Employee') & (pair_df['salary_x'] < pair_df['salary_y'])]

Unnamed: 0,name_x,rank_x,salary_x,dummy,name_y,rank_y,salary_y
7,Jane Smith,Manager,55,1,Brett Daniels,Employee,57
