# Obtaining Data Types in a Matrix Column

### Import the Packages:

In [1]:
import pandas as pd
import numpy as np
import os 

### Load the Dataset

In [2]:
filename = os.path.join(os.getcwd(), "data", "censusData.csv")
df = pd.read_csv(filename, header=0)

### Inspect the Data 
Use the `head()` method to inspect DataFrame `df`.

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income
0,36,State-gov,112074,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Non-Female,0,0,45,United-States,<=50K
1,35,Private,32528,HS-grad,9,Married-civ-spouse,Handlers-cleaners,Husband,White,Non-Female,0,0,45,United-States,<=50K
2,21,Private,270043,Some-college,10,Never-married,Other-service,Own-child,White,Female,0,0,16,United-States,<=50K
3,45,Private,168837,Some-college,10,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,24,Canada,>50K
4,39,Private,297449,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Non-Female,0,0,40,United-States,>50K


### Get summary statistics by column using Pandas `describe()` Method

One useful way to quickly overview data and get insight into key statistics for each column is to use the Pandas DataFrame `describe()` method. Run the cell below to get more information about `describe()`. You can also access the online [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).

In [4]:
df.describe?

The code cell below runs the `describe()` method on DataFrame `df`. 

In [5]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,7000.0,7000.0,7000.0,7000.0,7000.0,7000.0
mean,38.596714,192433.5,10.049857,1079.000429,84.970286,40.107143
std,13.745594,106336.5,2.580982,7011.160679,400.142351,12.323946
min,17.0,18827.0,1.0,0.0,0.0,1.0
25%,28.0,120247.8,9.0,0.0,0.0,40.0
50%,37.0,182117.0,10.0,0.0,0.0,40.0
75%,47.0,240237.0,12.0,0.0,0.0,45.0
max,90.0,1268339.0,16.0,99999.0,4356.0,99.0


###  Get the Data Types for all Columns using Pandas `dtypes` Property.

Note that some columns are excluded from the summary statistics above. This is because by default, the `decribe()` method only includes numerically valued columns. You can inspect the data type of a column's values by using the `dtypes` property. Run the code cell below and inspect the results.

In [6]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex_selfID        object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

Let's take a closer look at the results.
Even if you are familiar with the data types in python, the results above may seem confusing. For example, what is an `object` type?
Not to worry: Pandas uses its own convention for referring to data types. Here is a simple table to help you map Pandas data types to other data types:

<table>
  <tr>
    <th>Pandas dtype       </th>
    <th>Python type        </th>
    <th>NumPy type         </th>   
      <th>Usage</th>
      <tr><td>object</td><td>str or mixed</td><td>string_, unicode_, mixed types</td><td>Text or mixed numeric and non-numeric values</td><tr>
<tr><td>int64	</td><td>int</td><td>int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64</td><td>Integer numbers</td></tr>
<tr><td>float64</td><td>float</td><td>float_, float16, float32, float64</td><td>Floating point numbers</td></tr>
<tr><td>bool</td><td>bool</td><td>bool_</td><td>True/False values</td></tr>
<tr><td>datetime64</td><td>NA</td><td>datetime64[ns]</td><td>Date and time values</td></tr>

  </tr>
     <tr>
      <td>category</td>	
      <td>NA	</td>
      <td>NA	</td>
      <td>Finite list of text values</td>
  </tr>
  <tr>
    <td>timedelta[ns]</td>
    <td>NA</td>
       <td>NA</td>
    <td>Differences between two datetimes</td>
  </tr>
  <tr>
      <td>category</td>	
      <td>NA	</td>
      <td>NA	</td>
      <td>Finite list of text values</td>
  </tr>
    
</table>



In the cell below, call `df.describe()` with the parameter `include='all'` . This will produce summary statistics for all columns in DataFrame `df`. Examine the results. The `describe()` method now produces a quick and easy way to access balance with regard to the label, sex, race, and other columns containing string
values.
In particular, observe the values in `count`, `unique`, and `top`  for the `label` column:
our dataset does not appear to have a stark imbalance of one of the label classes.

In [7]:
df.describe(include='all')

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex_selfID,capital-gain,capital-loss,hours-per-week,native-country,income
count,7000.0,6625,7000.0,7000,7000.0,7000,6625,7000,7000,7000,7000.0,7000.0,7000.0,6862,7000
unique,,7,,16,,7,14,6,5,2,,,,40,2
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Non-Female,,,,United-States,<=50K
freq,,4879,,2263,,3277,911,2878,5990,4731,,,,6233,5319
mean,38.596714,,192433.5,,10.049857,,,,,,1079.000429,84.970286,40.107143,,
std,13.745594,,106336.5,,2.580982,,,,,,7011.160679,400.142351,12.323946,,
min,17.0,,18827.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,120247.8,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,182117.0,,10.0,,,,,,0.0,0.0,40.0,,
75%,47.0,,240237.0,,12.0,,,,,,0.0,0.0,45.0,,


### A More Detailed Way to Read Column Types using `pd.api.types.infer_dtype()`

The code cell below creates a dictionary in which each key corresponds to a column name and each value corresponds to its data type. It uses the function `pd.api.types.infer_dtype()` to find the data type of each column. Run the cell below and inspect the results.

In [8]:
types_dict = {}
for column in df.columns:
    types_dict[column] = pd.api.types.infer_dtype(df[column])

types_dict

{'age': 'integer',
 'capital-gain': 'integer',
 'capital-loss': 'integer',
 'education': 'string',
 'education-num': 'integer',
 'fnlwgt': 'integer',
 'hours-per-week': 'integer',
 'income': 'string',
 'marital-status': 'string',
 'native-country': 'string',
 'occupation': 'string',
 'race': 'string',
 'relationship': 'string',
 'sex_selfID': 'string',
 'workclass': 'string'}