# Data Cleaning
## Become familiar with: 
* Handling various kind of data importing scenarios that is importing various kind of datasets (.csv, .txt), different kind of delimiters (comma, tab, pipe), and different methods (read_csv, read_table)
* Getting basic information, such as dimensions, column names, and statistics summary
* Getting basic data cleaning done that is removing NAs and blank spaces, imputing values to missing data points, changing a variable type, and so on
* Creating dummy variables in various scenarios to aid modelling
* Generating simple plots like scatter plots, bar charts, histograms, box plots, and so on

# Outline

## Reading the data – variations and examples

### Data frames
A data frame is one of the most common data structures available in Python. Data
frames are very similar to the tables in a spreadsheet or a SQL table. In Python
vocabulary, it can also be thought of as a dictionary of series objects (in terms of
structure). A data frame, like a spreadsheet, has index labels (analogous to rows)
and column labels (analogous to columns). It is the most commonly used pandas
object and is a 2D structure with columns of different or same types. Most of the
standard operations, such as aggregation, filtering, pivoting, and so on which can
be applied on a spreadsheet or the SQL table can be applied to data frames using
methods in pandas.

### Delimiters 
A delimiter is a special character that separates various columns of a dataset from
one another. The most common (one can go to the extent of saying that it is a default
delimiter) delimiter is a comma (,). A .csv file is called so because it has comma
separated values.

## Various methods of importing data in Python

### Case 1 – reading a dataset using the read_csv method

#### The read_csv method
The name of the method doesn't unveil its full might. It is a kind of misnomer in
the sense that it makes us think that it can be used to read only CSV files, which is
not the case. Various kinds of files, including .txt files having delimiters of various
kinds can be read using this method.
```
pd.read_csv(filepath, sep=', ', dtype=None, header=None,
skiprows=None, index_col=None, skip_blank_lines=TRUE, na_filter=TRUE)
```
* `filepath`: filepath is the complete address of the dataset or file that you
are trying to read. The complete address includes the address of the directory
in which the file is stored and the full name of the file with its extension.
Remember to use a forward slash (/) in the directory address. Later in this
chapter, we will see that the filepath can be a URL as well.
* `sep:` sep allows us to specify the delimiter for the dataset to read. By default,
the method assumes that the delimiter is a comma (,). The various other
delimiters that are commonly used are blank spaces ( ), tab (|), and are called
space delimiter or tab demilited datasets. This argument of the method also
takes regular expressions as a value.
* `dtype:` Sometimes certain columns of the dataset need to be formatted to
some other type, in order to apply certain operations successfully. One
example is the date variables. Very often, they have a string type which
needs to be converted to date type before we can use them to apply daterelated
operations. The dtype argument is to specify the data type of the
columns of the dataset. Suppose, two columns a and b, of the dataset need to
be formatted to the types int32 and float64; it can be achieved by passing
`{'a':np.float64, 'b'.np.int32}` as the value of dtype. If not specified, it
will leave the columns in the same format as originally found.
* `header:` The value of a header argument can be an integer or a list.
Most of the times, datasets have a header containing the column names.
The header argument is used to specify which row to be used as the header.
By default, the first row is the header and it can be represented as header
=0. If one doesn't specify the header argument, it is as good as specifying
header=0. If one specifies header=None, the method will read the data
without the header containing the column names.

#### Use cases of the read_csv method

#### Passing the directory address and filename as variables

#### Reading a .txt dataset with a comma delimiter

#### Specifying the column names of a dataset from a list

### Case 2 – reading a dataset using the open method of Python

#### Reading a dataset line by line

#### Changing the delimiter of a dataset

### Case 3 – reading data from a URL 

### Case 4 – miscellaneous cases

#### Reading from an .xls or .xlsx file

#### Writing to a CSV or Excel file

### Basics – summary, dimensions, and structure

### Handling missing values

### Checking for missing values

### How missing values are generated and propagated

### Treating missing values

#### Deletion

#### Imputation

### Creating dummy variables

### Visualizing a dataset by basic plotting

#### Scatter plots

#### Histograms

#### Boxplots

## Summary

In [1]:
import pandas as pd
data = pd.read_csv('./Chapter2/titanic3.csv')