# Data Science and Visualization (RUC F2023)

## Lecture 2: Exploratory Data Analysis (EDA)

# Data preprocessing and EDA

This notebook demonstrates how to preprocess data and conduct EDA

## 1. Load the dataset and look at its basic statistics

We still use the Fortune 500 dataset but this time we've copied it to a local folder.

In [1]:
import pandas as pd
import os

os.chdir('C:/Data')
# os.getcwd()

# I've stored the .csv in C:/Data/
data = pd.read_csv('fortune500.csv')
data.head()

Unnamed: 0,Year,Rank,Company,Revenue (in millions),Profit (in millions)
0,1955,1,General Motors,9823.5,806.0
1,1955,2,Exxon Mobil,5661.4,584.8
2,1955,3,U.S. Steel,3250.4,195.4
3,1955,4,General Electric,2959.1,212.6
4,1955,5,Esmark,2510.8,19.1


In [2]:
data.shape

(25500, 5)

All column names in a DataFrame are organized as a List (in particular, pandas.core.indexes.base.Index)

In [3]:
data.columns

Index(['Year', 'Rank', 'Company', 'Revenue (in millions)',
       'Profit (in millions)'],
      dtype='object')

Since some of the column names are not concise enough and their inline blanks may cause problems for operations later, let's change them.

In [4]:
data.columns = ['year', 'rank', 'company', 'revenue', 'profit']
data.columns

Index(['year', 'rank', 'company', 'revenue', 'profit'], dtype='object')

We can call the info() function to print a concise summary of a DataFrame object.

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25500 entries, 0 to 25499
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   year     25500 non-null  int64  
 1   rank     25500 non-null  int64  
 2   company  25500 non-null  object 
 3   revenue  25500 non-null  float64
 4   profit   25500 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 996.2+ KB


In [8]:
data.isnull().sum()

year       0
rank       0
company    0
revenue    0
profit     0
dtype: int64

**Question**: What are the top-10 companies that appear in Fortunate 500 most frequently?<br>

* **value_counts()** counts the occurances for each distinct value for a given dimension.

In [12]:
data.company.value_counts().nlargest(10)

CBS               57
OfficeMax         55
General Motors    51
Anheuser-Busch    51
Ashland           51
McGraw-Hill       51
General Mills     51
H.J. Heinz        51
PepsiCo           51
USG               51
Name: company, dtype: int64

## 2. Deal with 'missing' values for 'profit' dimension

The 'profit' should be *numeric*, but why is it of the type of *object*? Let's get a description of it.

In [13]:
(data['profit']).describe()

count     25500
unique     6977
top        N.A.
freq        369
Name: profit, dtype: object

There are 369 rows in which the 'profit' dimension's value is 'N.A.'! Note this is not one of the *standard* missing value (NaN). We can call isnull() to check this.

In [14]:
data['profit'].isnull().sum()

0

We can show all those N.A. values under the 'profit' column:

In [8]:
pd.set_option('display.max_rows', None)
data['profit'][data['profit']=='N.A.']

228      N.A.
290      N.A.
294      N.A.
296      N.A.
352      N.A.
464      N.A.
712      N.A.
857      N.A.
1225     N.A.
1372     N.A.
1419     N.A.
1717     N.A.
1738     N.A.
1900     N.A.
2264     N.A.
2456     N.A.
2491     N.A.
2754     N.A.
3253     N.A.
3442     N.A.
3754     N.A.
3998     N.A.
5485     N.A.
8032     N.A.
8183     N.A.
8191     N.A.
8193     N.A.
8296     N.A.
8313     N.A.
8481     N.A.
8646     N.A.
8672     N.A.
8688     N.A.
8689     N.A.
8788     N.A.
8817     N.A.
8931     N.A.
8959     N.A.
9132     N.A.
9184     N.A.
9187     N.A.
9195     N.A.
9279     N.A.
9330     N.A.
9459     N.A.
9654     N.A.
9681     N.A.
9685     N.A.
9697     N.A.
9765     N.A.
9872     N.A.
9942     N.A.
10140    N.A.
10161    N.A.
10182    N.A.
10192    N.A.
10261    N.A.
10434    N.A.
10634    N.A.
10640    N.A.
10653    N.A.
10679    N.A.
10738    N.A.
10858    N.A.
10937    N.A.
11122    N.A.
11136    N.A.
11161    N.A.
11182    N.A.
11250    N.A.
11353    N.A.
11459 

We can get the number of rows with 'N.A.' profit values in another way:

In [15]:
len(data['profit'][data['profit']=='N.A.'])

369

We can change the N.A. profit values to some numeric values through a few steps

### (1) N.A. value replacement on profit dimension

We need to change the type (typecast) of 'profit' column, from str to float64. But before that, we have to replace all 'N.A.' values with an appropriate value. Otherwise, str objects cannot be converted to float64 objects. Here we play with a trick. We use np.nan to replace all 'N.A.'s.

In [18]:
import numpy as np

data.loc[data["profit"] == "N.A.", "profit"] = np.nan

Now, check again how many NaN values in the 'profit' column:

In [19]:
data['profit'].isna().sum()

369

### (2) typecast on profit dimension

We call **astype()** call the profit dimension. Remember to assign the new type back to data.profit.

In [22]:
# Now we will convert it into 'float64' type.
data.profit = data.profit.astype('float64')
  
data['profit'].describe()

count    25131.000000
mean       207.903677
std       1173.695947
min     -98696.000000
25%          8.900000
50%         35.500000
75%        150.500000
max      25330.000000
Name: profit, dtype: float64

### (3) filling for missing values

Now we change the previously 'N.A.'' and current np.nan values to the *average* profit of the corresponding year. Here we need an operation of **groupby + transform**.

Let's first get the average yearly 'profit':

In [24]:
data.groupby('year')['profit'].mean()

year
1955      16.732186
1956      22.509237
1957      23.181087
1958      23.456740
1959      19.279276
1960      24.019439
1961      23.340361
1962      23.326104
1963      26.965000
1964      29.675800
1965      34.543287
1966      40.023000
1967      44.155400
1968      42.801200
1969      48.387400
1970      49.354800
1971      43.952535
1972      47.585569
1973      56.448479
1974      78.461460
1975      88.345547
1976      76.774645
1977     100.138742
1978     107.128921
1979     125.347862
1980     158.932049
1981     165.021748
1982     171.926327
1983     125.104684
1984     140.670143
1985     176.473306
1986     145.980294
1987     134.419916
1988     188.693750
1989     241.518908
1990     219.716632
1991     194.674583
1992     114.938622
1993      -0.405165
1994     129.387810
1995     435.736640
1996     493.942105
1997     606.676815
1998     652.375050
1999     672.704628
2000     825.773387
2001     893.058753
2002     415.736694
2003     140.546263
2004     891.18

The following will replace each 'profit' value, no matter 'missing' or not, by the mean profit in its corresponding year group.

In [36]:
# In case we want to see what groupby followed by transform means
data.groupby('year')['profit'].transform('mean')

0          16.732186
1          16.732186
2          16.732186
3          16.732186
4          16.732186
            ...     
25495    1028.991182
25496    1028.991182
25497    1028.991182
25498    1028.991182
25499    1028.991182
Name: profit, Length: 25500, dtype: float64

But, we want to apply the same transformation to the missing values only. So we put all functions together:

In [25]:
data['profit'].fillna(data.groupby('year')['profit'].transform('mean'), inplace=True)

In [26]:
data['profit'].describe()

count    25500.000000
mean       207.445384
std       1165.337309
min     -98696.000000
25%          9.000000
50%         36.500000
75%        153.525000
max      25330.000000
Name: profit, dtype: float64

#### Summary

* x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))

How do we do it?

* fillna
* groupby
* transform<br>
    * 'mean': other options are 'min', 'max', 'median'
    * inplace=True

## A small example for better understanding groupby + transform

In [37]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two'],
                   'C' : [1, 5, 5, 2, 5, 5]})
df

Unnamed: 0,A,B,C
0,foo,one,1
1,bar,one,5
2,foo,two,5
3,bar,three,2
4,foo,two,5
5,bar,two,5


In [50]:
df.groupby('A').mean()

Unnamed: 0_level_0,C
A,Unnamed: 1_level_1
bar,4.0
foo,3.666667


In [52]:
df.groupby('A')['C'].transform('mean')

0    3.666667
1    4.000000
2    3.666667
3    4.000000
4    3.666667
5    4.000000
Name: C, dtype: float64

In [51]:
grouped = df.groupby('A')
grouped['C'].transform('mean')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000241C25C9430>
