# Cleaning Data in Python

# Exploring your data

## Diagnose data for cleaning
* Prepare data for analysis
* Data almost never comes in clean
* Diagnose your data for problems

### Common data problems
* Inconsistent column names
* Missing data
* Outliers
* Duplicate rows
* Untidy
* Need to process columns
* Column types can signal unexpected data values

In [3]:
# Load your data
import pandas as pd
df = pd.read_csv('literary_birth_rate.csv')

In [5]:
# Visually inspect
print(df.head())
print(df.tail())

    Country  Continent female literacy fertility     population  Unnamed: 5  \
0      Chine       ASI            90.5     1.769  1,324,655,000         NaN   
1       Inde       ASI            50.8     2.682  1,139,964,932         NaN   
2        USA       NAM              99     2.077    304,060,000         NaN   
3  Indonésie       ASI            88.8     2.132    227,345,082         NaN   
4     Brésil       LAT            90.2     1.827    191,971,506         NaN   

   Unnamed: 6  
0         NaN  
1         NaN  
2         NaN  
3         NaN  
4         NaN  
                               Country  Continent female literacy fertility  \
177              Antilles néerlandaises       NaN            96.3       NaN   
178                       Iles Caïmanes       NaN              99       NaN   
179                          Seychelles       NaN            92.3       NaN   
180  Territoires autonomes palestiniens       NaN            90.9       NaN   
181                               

In [12]:
# Visually inspect
df.columns

Index(['Country ', 'Continent', 'female literacy', 'fertility', 'population',
       'Unnamed: 5', 'Unnamed: 6'],
      dtype='object')

In [9]:
# Visually inspect
df.shape

(182, 7)

In [11]:
# Visually inspect
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182 entries, 0 to 181
Data columns (total 7 columns):
Country            169 non-null object
Continent          164 non-null object
female literacy    169 non-null object
fertility          163 non-null object
population         162 non-null object
Unnamed: 5         0 non-null float64
Unnamed: 6         0 non-null float64
dtypes: float64(2), object(5)
memory usage: 6.4+ KB


### Loading and viewing your data
In this chapter, you're going to look at a subset of the Department of Buildings Job Application Filings dataset from the __[NYC Open Data](https://opendata.cityofnewyork.us/)__ portal. This dataset consists of job applications filed on January 22, 2017.

Your first task is to load this dataset into a DataFrame and then inspect it using the .head() and .tail() methods. However, you'll find out very quickly that the printed results don't allow you to see everything you need, since there are too many columns. Therefore, you need to look at the data in another way.

The .shape and .columns attributes let you see the shape of the DataFrame and obtain a list of its columns. From here, you can see which columns are relevant to the questions you'd like to ask of the data. To this end, a new DataFrame, df_subset, consisting only of these relevant columns, has been pre-loaded. This is the DataFrame you'll work with in the rest of the chapter.

Get acquainted with the dataset now by exploring it with pandas! This initial exploratory analysis is a crucial first step of data cleaning.

* Import pandas as pd.
* Read 'dob_job_application_filings_subset.csv' into a DataFrame called df.
* Print the head and tail of df.
* Print the shape of df and its columns. Note: .shape and .columns are attributes, not methods, so you don't need to follow these with parentheses ().
* Hit 'Submit Answer' to view the results! Notice the suspicious number of 0 values. Perhaps these represent missing data.

In [16]:
# Import pandas
import pandas as pd

# Read the file into a DataFrame: df
df = pd.read_csv('dob_job_application_filings_subset.csv')

# Print the head of df
print(df.head())

# Print the tail of df
print(df.tail())

# Print the shape of df
print(df.shape)

# Print the columns of df
print(df.columns)

# Print the head and tail of df_subset
df_subset = pd.read_csv('df_dob_job_application_filings_subset2.csv')
print(df_subset.columns)
print(df_subset.head())
print(df_subset.tail())

       Job #  Doc #        Borough       House #  \
0  121577873      2      MANHATTAN  386            
1  520129502      1  STATEN ISLAND  107            
2  121601560      1      MANHATTAN  63             
3  121601203      1      MANHATTAN  48             
4  121601338      1      MANHATTAN  45             

                        Street Name  Block  Lot    Bin # Job Type Job Status  \
0  PARK AVENUE SOUTH                   857   38  1016890       A2          D   
1  KNOX PLACE                          342    1  5161350       A3          A   
2  WEST 131 STREET                    1729    9  1053831       A2          Q   
3  WEST 25TH STREET                    826   69  1015610       A2          D   
4  WEST 29 STREET                      831    7  1015754       A3          D   

            ...                         Owner's Last Name  \
0           ...            MIGLIORE                         
1           ...            BLUMENBERG                       
2           ...        