-----
<div class="alert alert-block alert-info">
    <h1>Introduction to EDA, <kbd>numpy</kbd> and <kbd>pandas</kbd></h1>

<div class="alert alert-block alert-warning">  
<h3> Introduction to EDA </h3>
    
EDA is a very broad term, but I take it to mean "learning useful information or formulating new questions about a population based upon an associated data set." 
    
This could be everything from interesting tidbits about how many people live in a particular area, to more important marketing information, such as what product characteristics people are willing to pay for.
    
</div>

<div class="alert alert-block alert-warning">  
<h3> Process of EDA </h3>
    
   Technically, The primary motive of EDA is to
    
   <ul>
   <li><b>LOOK AT THE DATA!</b></li>
   <li> Examine the data information including column names, data types, shape, values, stats and distributions.</li>
   <li> Handle anomalies and missing values.</li>
   <li> Transform numeric features, such as normalizing or scaling.</li>
   <li> Process categorical features, such as encoding.</li>
   <li> Create few features for advanced graphics or models. </li>
   <li> Examine trends and relationship between features.</li> 
      
       
   </ul>
 
    
</div>

<div class="alert alert-block alert-warning">  
<h3> Process of EDA </h3>
    
   You can sense that EDA has a lot of flexibilities, and the key word is the **Exploratory**
    
   <ul>
   <li> One of the best ways to learn EDA, is to walk through an actual data set together, picking apart the data to ask questions and answer them. </li>
   <li> Sometimes we don't know what the questions are until we see the data. As you'll see, we will often ask the question: "What's going on with the data here?"</li>
   <li> EDA in python often needs the combination of libraries <kbd>numpy</kbd>, <kbd>pandas</kbd> and <kbd>matplotlib</kbd>. The first two for processing of the data set, and <kbd>matplotlib</kbd> for graphical checks.</li>

      
       
   </ul>
 
    
</div>

<div class="alert alert-block alert-warning">  
<h3> Overview of EDA/Viz implementation topics in MSDS593</h3>
    
   <ul>
   <li> We will learn the different components in EDA and how to manipulate, query, slice, and clean up data using <kbd>pandas</kbd> and <kbd>numpy</kbd>. </li>
   <li> We will learn high-level concepts of visualization, such as how to choose an appropriate design, and how to write the code to implement visualizations in <kbd>matplotlib</kbd>.</li>  
    <li> There are hundreds of functions with thousands of arguments that we can use. I do not know many of these, and I forget a lot of them, and then I run to my mommy (genAI) for help. The key exercise for you is to keep <b>practicing, researching and exploring</b>. 
   </ul>

To start, let's have a quick walk through of <kbd>numpy</kbd>, and then introduce <kbd>pandas</kbd>. 
</div>

<div class="alert alert-block alert-warning">  
<h3> Outline </h3>
    

* 1.0 Create an environment  
* 1.1 Introduction to <kbd>numpy</kbd> 
* 1.2 Introduction to <kbd>pandas</kbd>
    * 1.2-1 Basic operations on pandas series
        * a. Creating series
        * b. Series indices and indexing
        * c. Missing values
    * 1.2-2 Basic operations on pandas data frames
        * a. DataFrame indexing
        * b. Functions on columns
        * c. Missing values
        * d. Altering data frames
* 1.3 The first steps of EDA using <kbd>pandas</kbd>
</div>

### 1.0 Create an environment

You will learn about this in more detail in the programming class, but we should get used to creating virtual environments to isolate the work we are doing for each class. We can create a new virtual environment for this course by running the following commands:

`python3 -m venv eda`  
`source eda/bin/activate`  

You should notice, in the terminal, that you have switched over to the virtual environment. To deactivate, simply run  
`deactivate`

I would suggest creating all of your virtual environments somewhere central, with the same path so that it's easy to remember, and avoid adding them to a git repository. I put mine in folder `~/tmp/env/`.

At this point we can install the libraries that we need in order to run the code below. For right now we need numpy, pandas, and ipykernel.

Let's try these versions of numpy, pandas and ipykernel:

`pip install numpy==2.0.2 pandas==2.3.0 ipykernel==6.29.5`

### 1.1 Introduction to <kbd>numpy</kbd>

Pandas uses NumPy in its implementation so it makes sense to take a look at numpy first. Numpy is how we will do most of our numerical computing. It provides flexible implementations of vectors and matrices (`ndarray`) and fast operations and functions for linear algebra, random number generation, etc.

I always start with the following preamble in my notebooks or Python files:

In [1]:
import numpy as np
import pandas as pd

That allows us to refer to numpy package elements with the shorthand `np` and the shorthand for pandas as `pd`.

To give you a taste, here's how we would create two vectors, add them together, and display the result:

In [2]:
a = np.array([1,2,3]) # create vector a=[1,2,3]
b = np.array([4,5,6]) # create vector b=[4,5,6]
print(" type:", type(a)) # check the type of a in python
print("shape:", a.shape) # check the shape of a in python
a - b   # basic math

 type: <class 'numpy.ndarray'>
shape: (3,)


array([-3, -3, -3])

Here's a 2 x 3 matrix with random elements:

In [3]:
C = np.array([[1,2,3],[4,5,6]]) # this is like a "list of lists" in Python
print(type(C))
print(C.shape)
print(C)

<class 'numpy.ndarray'>
(2, 3)
[[1 2 3]
 [4 5 6]]


And the following `@` operator does a matrix multiply $Ca$:

In [4]:
C @ a

array([14, 32])

In [5]:
C.dot(a)

array([14, 32])

<div class="alert alert-block alert-success">
Notice that we didn't need any Python looping to implement matrix multiplication ourselves. All of this is built into numpy. This just scratches the surface but gives you a taste of what's possible with numpy, and we will see more as we go through the course. 
    </div>

### 1.2 Introduction to <kbd>pandas</kbd>

Pandas is kind of like a table in a database or a spreadsheet that we can manipulate programmatically in Python. Pandas has two primary entities that you must be careful to distinguish between to avoid getting confused:

<ul>
<li><kbd>DataFrame</kbd> is a 2D tabular data structure; it has both rows and columns</li>
<li><kbd>Series</kbd> is a 1D array (column) data structure</li>
    </ul>
    
While a series looks like a column from a data frame, it is a separate object with different sets of functions that you can apply to it. It is so important to distinguish between them that you may want to consider prefixing the name of all data frame objects with *df_*, such as *df_salesinfo*.

<div class="alert alert-block alert-success">
Pandas data frames will be your primary data structure until machine learning, where you will learn how to build more complicated data structures such as decision trees. In time series we will see more pandas series objects. 
</div>

<div class="alert alert-block alert-success">
<b>Introduction to <kbd>pandas</kbd> outline:</b>
    
<ol>

<li>Series: indexing, operation and missing values</li>
<li>Data frame: selecting, slicing, method chaining, indexes and missing values</li>
<li>The first steps of EDA using <kbd>pandas</kbd></li>


</ol>

### 1.2-1. Basic operations on pandas series

A series object is a one-dimensional sequence of values, all with the same data type. For example, I could have a series of integers, strings, or floating-point numbers. We can think about series as columns of dataframes to get familiar with the operations. Later, we will see more of its own value in time series. 

### a. Creating pandas series

#### Element data types

You can learn more about all of the [pandas data types](https://numpy.org/doc/stable/user/basics.types.html), but the key idea is that we must distinguish between integers, floating-point values (reals), boolean, strings, and datetime.


|Description| Python | Pandas|
| ----------- | --------- | ---- |
|Integers| `int`|`int64`|
|Text| `str`| `object`|
|Double/Float| `float`|`float64` |
|Boolean (T/F)| `bool`| `bool`|
|Date/Time| `datetime` | `datetime64`|
|[Categorical](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)| n/a | `category` (held internally as an integer)|


All entities in Python have both a value and associated type; everything is just bits in the computer and so Python needs to know how to interpret the bits as a string or number or date. Different types behave differently and respond to a different set of functions.

In [6]:
# make a series with four numbers of type float

a = pd.Series([10,2.4,81,1005])  
print(type(a))
a

<class 'pandas.core.series.Series'>


0      10.0
1       2.4
2      81.0
3    1005.0
dtype: float64

In [7]:
len(a), a.dtype

(4, dtype('float64'))

In [8]:
# make a series with strings of type object

b = pd.Series(['Xue','Mary','Ollie']) 
print("Type is", b.dtype)
b

Type is object


0      Xue
1     Mary
2    Ollie
dtype: object

There are lots of functions you can perform on series. For example, here are some operations for numerical columns:

In [9]:
a.min(), a.max(), a.mean(), a.sum(), a.count()

(np.float64(2.4),
 np.float64(1005.0),
 np.float64(274.6),
 np.float64(1098.4),
 np.int64(4))

The string-related functions are available using the `str` member of the series:

In [10]:
b.str.lower() #convert uppercases to lowercases

0      xue
1     mary
2    ollie
dtype: object

In [11]:
b.str.extract("([A-Z])") #extract the first uppercase letter using regular expression

Unnamed: 0,0
0,X
1,M
2,O


<div class="alert alert-block alert-danger">
    
<b> Practice 1</b>

<ol>
    
<li>Run all the code above and make sure everything works</li>
<li>Create a new notebook called "Notebook 1 Practice", import <kbd>numpy</kbd> and <kbd>pandas</kbd></li>
<li>Create a series `name` to record the names (first and last) of three classmates around you.</li>
<li>Convert all names to upper case.</li>
<li>Create a series `age` to record the age of the same three classmates around you.</li> 
<li>Calculate the average(mean) age of the series `age`.</li>     
</ol>            


</div>

### b. Series indexes and indexing

Each series object has an associated index that sort of names the elements of the series, just as the index of the data frame identifies the rows of the data (see later).  The index values are commonly integers or strings. If we don't specifically specify an index, the index is just a series of consecutive integers starting from zero. The series then behaves just like a list in Python when we use the brackets `[...]` index operator.

We all understand the idea of a list of numbers in Python and we access the elements using the same syntax. The only difference is that the values inside the `[...]` can only be integers for Python lists. In pandas, the index type to use must be the same as the type of the index of the series.

This shows the standard index is simply the integer positions:

In [12]:
a

0      10.0
1       2.4
2      81.0
3    1005.0
dtype: float64

In [13]:
a.index

RangeIndex(start=0, stop=4, step=1)

In [14]:
a[2]

np.float64(81.0)

But, we can create a series with, for example, an index whose members are strings:

In [15]:
b = pd.Series([10,2.4,81,1005], index=['t','u','v','w'])
b

t      10.0
u       2.4
v      81.0
w    1005.0
dtype: float64

In [16]:
b.index

Index(['t', 'u', 'v', 'w'], dtype='object')

In [17]:
a[2], a[3]

(np.float64(81.0), np.float64(1005.0))

In [18]:
b['u'], b['w']

(np.float64(2.4), np.float64(1005.0))

In [19]:
b[2],b[3]

  b[2],b[3]


(np.float64(81.0), np.float64(1005.0))

While this direct indexing looks convenient, it's better to use the more verbose but explicit `iloc` or `loc`. This advantgage will show better in the data frame case.

<div class="alert alert-block alert-success">
    
The main distinction between `loc` and `iloc` is: `loc` is label-based, which means that you have to specify rows and columns based on their row and column labels. `iloc` is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position)
    
</div>

In [20]:
a.iloc[2], a.loc[2]

(np.float64(81.0), np.float64(81.0))

In [21]:
b.loc['u'], b.iloc[1]

(np.float64(2.4), np.float64(2.4))

We can use the same notation to set values

In [22]:
a

0      10.0
1       2.4
2      81.0
3    1005.0
dtype: float64

In [23]:
a.iloc[2] = 9999
a

0      10.0
1       2.4
2    9999.0
3    1005.0
dtype: float64

In [24]:
b.loc['u'] = 9999
b

t      10.0
u    9999.0
v      81.0
w    1005.0
dtype: float64

### Selecting multiple values with indexing

We can even use **multiple values** when indexing:

In [25]:
b.loc[['u','w']] # a list of indice ['u','w']

u    9999.0
w    1005.0
dtype: float64

<div class="alert alert-block alert-danger">
    
How do we select multiple values with `iloc`?

</div>

<div class="alert alert-block alert-danger">
    
<b> Practice 2</b>

<ol>
    
<li>Define a new series where the values are the ages and the index is the names</li>
<li> Try to use `[...]`, `loc` and `iloc` to select value(s) from your series </li> 
</ol>            


</div>

### Pandas matches up series index values

Pandas matches up series index values.
The index of the series elements makes column arithmetic (like a+b) a little more complex but much more sophisticated. 

For example, imagine that we have values associated with multiple states in the US, but where the values are not lined up according to state:

In [26]:
income = pd.Series([100,110,200,45], index=['WA','NV','CA','IA'])
taxes  = pd.Series([20,25,5], index=['IA','NV','WA']) # different order
print(income)
print(taxes)

WA    100
NV    110
CA    200
IA     45
dtype: int64
IA    20
NV    25
WA     5
dtype: int64


In [27]:
income - taxes

CA     NaN
IA    25.0
NV    85.0
WA    95.0
dtype: float64

<div class="alert alert-block alert-success">
The key thing to notice here is that we don't have tax information for CA. When we add the two series together, pandas lines them up according to the index values so that things match up: WA to WA etc.
</div>

### Relational operators for series

To give you a taste of the power of pandas,  we can even use relational operators to select elements from a series. Here's how to get all values less than 100 from `a`:

In [28]:
a.loc[a<100]

0    10.0
1     2.4
dtype: float64

In [29]:
a<100

0     True
1     True
2    False
3    False
dtype: bool

###  Series look like dictionaries

You can think of a series as a dictionary where the index values map to values in the series. But, unlike dictionaries, Series index values do not have to be unique:

In [30]:
x = pd.Series([1,2,3], index=['u','u','t'])
x

u    1
u    2
t    3
dtype: int64

In [31]:
y = pd.Series([3,1,2], index=['u','d','t'])

In [32]:
y+x

d    NaN
t    5.0
u    4.0
u    5.0
dtype: float64

In [33]:
x.loc['u']

u    1
u    2
dtype: int64

### Arithmetic with series

As with numpy vectors, we can add series together and perform lots of other operations. Let's make some simple series of numbers and add them together:

In [34]:
a = pd.Series(range(5,10))
b = pd.Series(range(30,35))
print(a)
print(b)

0    5
1    6
2    7
3    8
4    9
dtype: int64
0    30
1    31
2    32
3    33
4    34
dtype: int64


In [35]:
a+b

0    35
1    37
2    39
3    41
4    43
dtype: int64

In [36]:
a.values # get numpy array underlying the series

array([5, 6, 7, 8, 9])

In [37]:
a.values + b.values

array([35, 37, 39, 41, 43])

In [38]:
(a+b).values

array([35, 37, 39, 41, 43])

<div class="alert alert-block alert-success">
Keep in mind: Any arithmetic involving a nan, results in a nan.
</div>

### c. Missing values in series

Real-world data sets often have missing values. People who recorded the data have different ways to represent missing data. A common way to do this is to choose a sentinel value like -1 or a string like 'n/a' or 'missing', or leaving it blank. I've also seen datasets where 9999 was used as the missing value indicator for year values. 

Pandas formalizes missing values by representing them as "not a number", a special floating-point value, `np.nan`, from numpy that indicates the value is invalid and should not be treated as valid.

In [39]:
c = pd.Series([10,20,np.nan,40])
c

0    10.0
1    20.0
2     NaN
3    40.0
dtype: float64

When doing EDA, it's very common to ask if there are missing values, which we can do with `isnull()`:

In [40]:
c.isnull() # return a Boolean indicating whether the series value is nan or correct

0    False
1    False
2     True
3    False
dtype: bool

<div class="alert alert-block alert-success">
    
<b>A subtle but important point here</b>: `np.nan` has type `float` and, since all values in a series must be the same type, a series of integer values with at least one missing value causes the entire series to be floating-point. But, non-numeric series stay as `object`.
    
</div>

In [41]:
d = pd.Series(['hi','mom',np.nan])
d

0     hi
1    mom
2    NaN
dtype: object

In [42]:
d.isnull()

0    False
1    False
2     True
dtype: bool

To get all of the non-missing data

In [43]:
c.loc[~c.isnull()]  # ~c.isnull() means "not null" or "not np.nan"

0    10.0
1    20.0
3    40.0
dtype: float64

In [44]:
d.loc[~d.isnull()]

0     hi
1    mom
dtype: object

If you ever want a version of a series with the NaNs replaced with a fill value, there is a handy function for you:

In [45]:
c.fillna(1000)

0      10.0
1      20.0
2    1000.0
3      40.0
dtype: float64

###  Functions do different things with missing values

You must be very careful that you know how various functions behave with respect to np.nan. Here are some examples:

In [46]:
c = pd.Series([10,20,np.nan,40])
c

0    10.0
1    20.0
2     NaN
3    40.0
dtype: float64

In [47]:
len(c), c.count(), c.sum(), c.sum(skipna=False), np.sum(c), sum(c)

(4, np.int64(3), np.float64(70.0), np.float64(nan), np.float64(70.0), nan)

In [48]:
c * 3

0     30.0
1     60.0
2      NaN
3    120.0
dtype: float64

<div class="alert alert-block alert-danger">
    
<b> Practice 3</b>

<ol>
    
<li>Add a new missing value to your series from Practice 2 with the index "Robert"</li>
<li> Now try to find the average(mean) age of the series ignoring the missing value. There are different approaches to do this.  </li> 
</ol>            


</div>

### 1.2-2 Basic operations on pandas DataFrames

A `DataFrame` object is a two-dimensional matrix with rows and columns just like a spreadsheet or database table. Each row corresponds to an <b>instance</b> or <b>observation</b>, and columns are <b>features</b> of this instance.  

Each column can have different data types, but all values within a column must be of the same data type; the columns behave like series objects.

DataFrame columns are ordered and the name-to-column mapping is stored in an index.  DataFrames also have an index for the rows, just like a series has an index into the values of the series. So, a data frame has two indexes which lets us zoom in, for example, on a specific element using row and column index values.



Let's load the `cars.csv` data from the same directory as this notebook. You can download this file onto your own computer wherever you want, but make sure that you specify the appropriate file name when loading it with pandas.

In [49]:
df_cars = pd.read_csv("../data/cars.csv")

We can check the first few rows using `df.head()`

In [50]:
df_cars.head(3)

Unnamed: 0,MPG,CYL,ENG,WGT
0,18.0,8,307.0,3504
1,15.0,8,350.0,3693
2,18.0,8,318.0,3436


### a. DataFrame indexing

Recall that when you index into a series, you get an element.

The default index of a pd.dataframe using `[...]` is the *column names*. The default index for the rows is just the integer position.

#### Column indexing

Let's use `[...]` index operator, which gets a series object:

In [51]:
df_cars['MPG'] 

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
387    27.0
388    44.0
389    32.0
390    28.0
391    31.0
Name: MPG, Length: 392, dtype: float64

In [52]:
type(df_cars['MPG'])

pandas.core.series.Series

In [53]:
mpg = df_cars['MPG']
mpg

0      18.0
1      15.0
2      18.0
3      16.0
4      17.0
       ... 
387    27.0
388    44.0
389    32.0
390    28.0
391    31.0
Name: MPG, Length: 392, dtype: float64

Once we have a series object, we can use the series index to get elements (repetition to improve your retention):

In [54]:
mpg.iloc[3], mpg.loc[3] 

(np.float64(16.0), np.float64(16.0))

<div class="alert alert-block alert-danger">
    
<b> Practice 4</b>
Let's practice on the Kaggle competition dataset: [Kaggle's Uber Pickups in New York City competition](https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city). A small subsest of the data is provided as `kaggle-uber-other-federal.csv` on canvas.

<ol>
    
<li>Download and read it into a pandas DataFrame</li>
<li>Extract the column `Status` and define it as a new series</li>
<li>Extract the 10th index from the defined series</li>
    
</ol>            


</div>

#### Row indexing


Let's look at the Uber data together now and extract three columns. Please note the use of a list as an index value and the fact that we get a data frame back not a series because we asked for more than one column:

In [55]:
df_uber = pd.read_csv('../data/kaggle-uber-other-federal.csv')
df_status= df_uber[['Date','Time','Status']]
print(type(df_status))
df_status.head(3)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Date,Time,Status
0,07/01/2014,07:15 AM,Cancelled
1,07/01/2014,07:30 AM,Arrived
2,07/01/2014,08:00 AM,Assigned


In [56]:
df_status.index

RangeIndex(start=0, stop=99, step=1)

In [57]:
df_status.iloc[0] # first row

Date      07/01/2014
Time        07:15 AM
Status     Cancelled
Name: 0, dtype: object

In [58]:
df_status.iloc[2:4] # a slice of rows

Unnamed: 0,Date,Time,Status
2,07/01/2014,08:00 AM,Assigned
3,07/01/2014,09:00 AM,Assigned


In [59]:
df_status.loc[2:4] # did you notice any difference?

Unnamed: 0,Date,Time,Status
2,07/01/2014,08:00 AM,Assigned
3,07/01/2014,09:00 AM,Assigned
4,07/01/2014,09:30 AM,Assigned


```
df_status.iloc[:]  # get all rows
df_status.iloc[0:len(df_status)] # SAME
```

#### Indexing both rows and columns

In [60]:
df_status.iloc[:,0] # get all rows for column at position 0 ('Date') as a Series
#df_status.iloc[0:len(df_status),0] # SAME

0     07/01/2014
1     07/01/2014
2     07/01/2014
3     07/01/2014
4     07/01/2014
         ...    
94    07/21/2014
95    07/21/2014
96    07/21/2014
97    07/21/2014
98    07/22/2014
Name: Date, Length: 99, dtype: object

<div class="alert alert-block alert-success">
    
When we have too many columns, maybe it's easier to use the actual column names to index. Then the `loc` operator is a better friend:
    
</div>

In [61]:
df_status.loc[1,['Time','Status']]

Time      07:30 AM
Status     Arrived
Name: 1, dtype: object

<div class="alert alert-block alert-success">
    
<b> So many ways to do indexing!
When are `[...]`, `iloc` and `loc` the same or different?</b>
    
In the following cases, they are the same:

<ol>
    
<li>Selecting a single column "A":  df['A'] is the same as df.loc[:, 'A']. </li>
<li>    Selecting a list of columns df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] </li>
<li>Slicing by rows df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3. </li>
</ol>

In the following cases, they are the different, specifically, `[...]` doesn't work:

<ol>
    
<li>You can select a single row with df.loc[row_label]. </li>
<li>    You can select a list of rows with df.loc[[row_label1, row_label2]]</li>
<li>You can slice columns with df.loc[:, 'A':'C']</li>
</ol>
    
</div>

Another very useful operation is to set the index to one of the columns of the data frame. We see this a lot in time series:

In [62]:
df_status = df_status.set_index('Date')   
df_status.head(5)

Unnamed: 0_level_0,Time,Status
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
07/01/2014,07:15 AM,Cancelled
07/01/2014,07:30 AM,Arrived
07/01/2014,08:00 AM,Assigned
07/01/2014,09:00 AM,Assigned
07/01/2014,09:30 AM,Assigned


Then we can conveniently ask for all rows with a specific date using the index:

In [63]:
df_status.loc['07/03/2014'].head(3)

Unnamed: 0_level_0,Time,Status
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
07/03/2014,05:00 AM,Arrived
07/03/2014,05:45 AM,Assigned
07/03/2014,06:55 AM,Arrived


We can reset the index as well to make the index become a column again:

In [64]:
df_status = df_status.reset_index()
df_status.head(3)

Unnamed: 0,Date,Time,Status
0,07/01/2014,07:15 AM,Cancelled
1,07/01/2014,07:30 AM,Arrived
2,07/01/2014,08:00 AM,Assigned


### b. Functions on columns

As we did in the section on series, we can apply functions to columns in a data frame. A very common task is to look at the unique values:

In [65]:
df_status['Status'].unique()

array(['Cancelled', 'Arrived', 'Assigned'], dtype=object)

We can also ask for the count for each unique value:

In [66]:
df_status['Status'].value_counts()

Status
Arrived      58
Assigned     32
Cancelled     9
Name: count, dtype: int64

Because the columns are retrieved as series objects, we can do the same column arithmetic:

In [67]:
df_cars['MPG']/df_cars['WGT']

0      0.005137
1      0.004062
2      0.005239
3      0.004661
4      0.004929
         ...   
387    0.009677
388    0.020657
389    0.013943
390    0.010667
391    0.011397
Length: 392, dtype: float64

### c. Missing values in a data frame

Returning to our friend the missing value, we can ask for a matrix of Boolean values indicating whether a specific row and column value is missing:

In [68]:
df_status.isnull().head(5) 

Unnamed: 0,Date,Time,Status
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False


In [69]:
# how many in each column are missing? it turns out none are missing:
df_status.isnull().sum()  

Date      0
Time      0
Status    0
dtype: int64

In [70]:
 # how many in each row are missing?
df_status.isnull().sum(axis=1)

0     0
1     0
2     0
3     0
4     0
     ..
94    0
95    0
96    0
97    0
98    0
Length: 99, dtype: int64

In [71]:
# don't care about the count, just want to know if there is any absence of values
df_status.isnull().any() 

Date      False
Time      False
Status    False
dtype: bool

<div class="alert alert-block alert-danger">
    
<b> Practice 5</b>
           
<ol>
    
<li>Download and read the dataset `titanic.csv`. </li>
<li>Check the unique value and count of each value of "Survived". </li> 
<li>Check if there is any missingness in each column. </li>
<li>Check the number of missing values in "Embarked". </li>
</ol>
</div>

### d. Altering data frames

We often want to alter the values in a data frame, such as when we clean up data in preparation for modeling. We can set individual values, rows, or entire columns.

In [72]:
df_status['junk'] = 0      # insert a new column and set to 0
df_status.head(3)

Unnamed: 0,Date,Time,Status,junk
0,07/01/2014,07:15 AM,Cancelled,0
1,07/01/2014,07:30 AM,Arrived,0
2,07/01/2014,08:00 AM,Assigned,0


In [73]:
df_status['junk'] = 99     # overwrite a column with a single value
df_status.head(3)

Unnamed: 0,Date,Time,Status,junk
0,07/01/2014,07:15 AM,Cancelled,99
1,07/01/2014,07:30 AM,Arrived,99
2,07/01/2014,08:00 AM,Assigned,99


In [74]:
df_status = df_status.drop('junk', axis=1)    # axis=1 means drop a column
df_status.head(3)

Unnamed: 0,Date,Time,Status
0,07/01/2014,07:15 AM,Cancelled
1,07/01/2014,07:30 AM,Arrived
2,07/01/2014,08:00 AM,Assigned


Most of the time we are not deleting rows but we can do so using use `drop`.  Most likely we are asking for a subset of the data.  For example, here is how to find all trips that arrived:

In [75]:
df_status[df_status['Status']=='Arrived'].head(3)

Unnamed: 0,Date,Time,Status
1,07/01/2014,07:30 AM,Arrived
5,07/01/2014,12:00 PM,Arrived
8,07/01/2014,02:30 PM,Arrived


In [76]:
print(df_status.head(3))
df_status.loc[0,'Status'] = 'Arrived'     # Update a single categorical value in one column
df_status.head(3)

         Date      Time     Status
0  07/01/2014  07:15 AM  Cancelled
1  07/01/2014  07:30 AM    Arrived
2  07/01/2014  08:00 AM   Assigned


Unnamed: 0,Date,Time,Status
0,07/01/2014,07:15 AM,Arrived
1,07/01/2014,07:30 AM,Arrived
2,07/01/2014,08:00 AM,Assigned


<div class="alert alert-block alert-success">
    
<b>WARNING</b>: Injecting a new column into, say, `df_status` is no problem as long as the data frame referred to is the entire data frame, and not a subset.  
    
For example, the following gets an error because I'm updating a "slice" of the original data frame.

</div>

In [77]:
df_x = df_status.iloc[5:10]
df_x['foo'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_x['foo'] = 0


In [78]:
#solution: get a copy to define a new data frame
df_x = df_status.iloc[5:10].copy()
df_x['foo'] = 0
df_x.head(2)

Unnamed: 0,Date,Time,Status,foo
5,07/01/2014,12:00 PM,Arrived,0
6,07/01/2014,12:30 PM,Assigned,0


### 1.3. The first steps of EDA using <kbd>pandas</kbd>
    
While we go through this material, keep in mind that there is a basic process you can follow when confronted with a new data set:

<ol>
<li>How many rows(records, obervations,...) are there?</li>
<li>What are the column names and column datatypes?</li>
<li>Look at the data to see if the assigned/current column datatypes makes sense</li>
<li>Identify categorical columns, nominal and ordinal</li>
<li>Compute basic summary statistics for numerical columns, such as min/max/mean/median/quantiles</li>
<li>Compute basic summary tables for categorical columns, such as frequency and percentage of each category</li>
<li>*Visualize* some or all of the numerical columns to examine their distributions</li>
<li>Which columns have missing values? Watch out for sentinel values that are physically present, but indicate missing values (e.g., 0, -1, -999, 'none or unspecified', '')</li>
</ol>


### let's start 

In the beginning, let's


- Check data types



In [79]:
df_uber.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Date             99 non-null     object
 1   Time             99 non-null     object
 2   PU_Address       99 non-null     object
 3   DO_Address       98 non-null     object
 4   Routing Details  99 non-null     object
 5   PU_Address.1     99 non-null     object
 6   Status           99 non-null     object
dtypes: object(7)
memory usage: 5.5+ KB


<div class="alert alert-block alert-success">
    
Is anything looking wrong?

</div>

In [80]:
#change types when reading the data
df_uber = pd.read_csv("../data/kaggle-uber-other-federal.csv",
                      parse_dates=['Date','Time'],
                      dtype={'Status':'category'})
df_uber.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99 entries, 0 to 98
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Date             99 non-null     datetime64[ns]
 1   Time             99 non-null     datetime64[ns]
 2   PU_Address       99 non-null     object        
 3   DO_Address       98 non-null     object        
 4   Routing Details  99 non-null     object        
 5   PU_Address.1     99 non-null     object        
 6   Status           99 non-null     category      
dtypes: category(1), datetime64[ns](2), object(4)
memory usage: 5.0+ KB


  df_uber = pd.read_csv("../data/kaggle-uber-other-federal.csv",


<div class="alert alert-block alert-success">
    
`dtype` can handle all numpy data types like strings, integers, floats, etc.. But `datetime` has to be changed by `parse_dates`.
</div>

<div class="alert alert-block alert-success">

When dealing with more complex data, often we can also change the data types *after* reading the csv. For example, use `pd.to_datetime()` to change to datetime type. We will practice that when doing more feature engineering in the EDA for real dataset examples. 
</div>

In [81]:
#check your data every time you make changes
df_uber.head(2)

Unnamed: 0,Date,Time,PU_Address,DO_Address,Routing Details,PU_Address.1,Status
0,2014-07-01,2025-07-12 07:15:00,"Brooklyn Museum, 200 Eastern Pkwy., BK NY;","1 Brookdale Plaza, BK NY;","PU: Brooklyn Museum, 200 Eastern Pkwy., BK NY;...","Brooklyn Museum, 200 Eastern Pkwy., BK NY; DO:...",Cancelled
1,2014-07-01,2025-07-12 07:30:00,"33 Robert Dr., Short Hills NJ;","John F Kennedy International Airport, vitona A...","PU: 33 Robert Dr., Short Hills NJ; DO: John F ...","33 Robert Dr., Short Hills NJ; DO: John F Kenn...",Arrived


Ooops. That `Time` column got today's date added on. Convert to just time info:

In [82]:
#a taste of handling datetime columns
df_uber['Time'] = df_uber['Time'].dt.time #dt. time to extract just time
df_uber.head(2)

Unnamed: 0,Date,Time,PU_Address,DO_Address,Routing Details,PU_Address.1,Status
0,2014-07-01,07:15:00,"Brooklyn Museum, 200 Eastern Pkwy., BK NY;","1 Brookdale Plaza, BK NY;","PU: Brooklyn Museum, 200 Eastern Pkwy., BK NY;...","Brooklyn Museum, 200 Eastern Pkwy., BK NY; DO:...",Cancelled
1,2014-07-01,07:30:00,"33 Robert Dr., Short Hills NJ;","John F Kennedy International Airport, vitona A...","PU: 33 Robert Dr., Short Hills NJ; DO: John F ...","33 Robert Dr., Short Hills NJ; DO: John F Kenn...",Arrived



### Initial sniff of your data

Now we have the data read into Python with correct type, we can 

- Get the shape of the data
- Get some summary statistics 
- Convert, transform or create columns for a further check

#### - Get the shape of the data

You want to know the number of rows (records) and number of columns (variables/features) of your dataset. We also want to know the names of the columns which is provided by info() but we can ask for it explicitly, which is useful because we sometimes want to operate on the column names.

* Let start with the cars data:


In [83]:
df_cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   MPG     392 non-null    float64
 1   CYL     392 non-null    int64  
 2   ENG     392 non-null    float64
 3   WGT     392 non-null    int64  
dtypes: float64(2), int64(2)
memory usage: 12.4 KB


In [84]:
df_cars.shape


(392, 4)

In [85]:
len(df_cars) # number of rows

392

In [86]:
df_cars.columns, type(df_cars.columns) # gets a pandas `Index` object

(Index(['MPG', 'CYL', 'ENG', 'WGT'], dtype='object'),
 pandas.core.indexes.base.Index)

In [87]:
df_cars.columns.values, type(df_cars.columns.values) # gets a numpy version


(array(['MPG', 'CYL', 'ENG', 'WGT'], dtype=object), numpy.ndarray)

In [88]:
list(df_cars.columns) # get a simple Python list of column names


['MPG', 'CYL', 'ENG', 'WGT']

#### - Get some summary stats

- Summary stats of **numeric columns** can be obtained easily using `dt.describe()`. Occasionally if you have observed extreme min and max values, you can also calculate 
```
dt['col'].quantile([.0, 0.01, 0.25, .5, 0.75, 0.99, 1])
``` 
to check the majority of the data.

- Summary stats of **categorical columns** are normally summary tables with frequencies and proportions.  

- Summary stats for more messy columns of string or object types are not that trivial. You need to do more feature engeering and graphical visualization instead. 

In [89]:
df_cars.describe()

Unnamed: 0,MPG,CYL,ENG,WGT
count,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,2977.584184
std,7.805007,1.705783,104.644004,849.40256
min,9.0,3.0,68.0,1613.0
25%,17.0,4.0,105.0,2225.25
50%,22.75,4.0,151.0,2803.5
75%,29.0,8.0,275.75,3614.75
max,46.599998,8.0,455.0,5140.0


In [90]:
df_cars['MPG'].quantile([.0, 0.01, 0.25, .5, 0.75, 0.99, 1])


0.00     9.000000
0.01    11.000000
0.25    17.000000
0.50    22.750000
0.75    29.000000
0.99    43.454002
1.00    46.599998
Name: MPG, dtype: float64

<div class="alert alert-block alert-success">
I think I can get more information for "CYL" if it is categorical, or ordinal, instead of a numeric feature. WHY?

In [91]:
cyl = df_cars['CYL']   # Get a column
print(type(cyl))
print(cyl)

<class 'pandas.core.series.Series'>
0      8
1      8
2      8
3      8
4      8
      ..
387    4
388    4
389    4
390    4
391    4
Name: CYL, Length: 392, dtype: int64


In [92]:
#define frequency table
tab = pd.crosstab(index=df_cars['CYL'], columns='count')
print(tab)

col_0  count
CYL         
3          4
4        199
5          3
6         83
8        103


In [93]:
#find proportions 
tab/tab.sum()

#or tab=pd.crosstab(index=df_cars['CYL'], columns='percentage', normalize='columns')

col_0,count
CYL,Unnamed: 1_level_1
3,0.010204
4,0.507653
5,0.007653
6,0.211735
8,0.262755


#### -Convert, transform or create columns for a further check

Let's say we are interested in the ratio of ENG to CYL, indicating the horsepower per cylinder

In [94]:
df_cars['ratio_ENG_CYL']=df_cars['ENG']/df_cars['CYL']
df_cars.head()


Unnamed: 0,MPG,CYL,ENG,WGT,ratio_ENG_CYL
0,18.0,8,307.0,3504,38.375
1,15.0,8,350.0,3693,43.75
2,18.0,8,318.0,3436,39.75
3,16.0,8,304.0,3433,38.0
4,17.0,8,302.0,3449,37.75


<div class="alert alert-block alert-danger">
<b>Practice 6</b>
<ol>
<li>Display the column names for df_titanic DataFrame.</li>
<li>Display the number of rows for df_titanic DataFrame.</li>
<li>Apply `dt.describe()` to df_titanic DataFrame. What results do you observe? Quickly check the information it has provided.</li>
<li>Display the frequency and proportion table of "Pclass" from df_titanic.</li>
</ol> 