# Exploration Data Analysis (EDA) and Cleaning

- EDA - Data exploration analysis (ensure better information understanding, availability and accuracy)
    - Variables Identification
        - Naming convention consistency application
    - Univariate Analysis
    - Bi-variate Analysis
    - Missing Values Handling
    - Outliers
- Data preprocessing, feature engineering (make data ready for ML, have a remarkable impact on the power of prediction)
    - Variables transformation
        - normalization
        - standardization
    - Variable creation
        - character encoding
    - Working with dates
    - Inconsistent data entry
- Re-assessment and iteration


## Objective



## Dataset description

## Load Data

In [6]:
import pandas as pd
import numpy as np

seed = 1234

In [2]:
# Load a dataset
# https://www.kaggle.com/c/titanic/data
df_titanic = pd.read_csv("data/titanic_train.csv")

In [26]:
df_titanic.shape

(891, 12)

There are 891 `observations` / `cases` in the dataset.

In [3]:
df_titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df_titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [8]:
df_titanic.sample(5, random_state=seed)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
523,524,1,1,"Hippach, Mrs. Louis Albert (Ida Sophia Fischer)",female,44.0,0,1,111361,57.9792,B18,C
778,779,0,3,"Kilgannon, Mr. Thomas J",male,,0,0,36865,7.7375,,Q
760,761,0,3,"Garfirth, Mr. John",male,,0,0,358585,14.5,,S
496,497,1,1,"Eustis, Miss. Elizabeth Mussey",female,54.0,1,0,36947,78.2667,D20,C
583,584,0,1,"Ross, Mr. John Hugo",male,36.0,0,0,13049,40.125,A10,C


## Variables Identification

### Variables types
- **input** (predictors, independent features / variables / observations / attributes)
- **output** (target, dependent feature / variable / attribute, ground truth)

### Variables categories
- Continuous / Discrete
- Categorical
    - Nominal
    - Ordinal
    
qualitative and quantitative

### Data types
- Numeric
- Textual

## TODO
- wypisac z dataset - zob. opis w Kaggle
- wypisac i sklasyfikowac
- porownac zbior z modyfikacjami (predykcja) do zbioru bez modyfikacji (pousuwane missing i outlier) oraz do zbioru bez zadnych zmian

## Univariate Analysis
Explore variables individually.

### Continuous Variables
- Understand the central tendency and spread (dispersion).
    - Central tendency
        - Mean
        - Median
        - Mode
        - Min
        - Max
    - Measures of dispersion
        - Range
        - Quartiles
        - IQR
        - Variance
        - Standard deviation
        - Skewness and Kurtosis
- Identify missing values and outliers
- Visualization methods
    - Histogram
    - BoxPlot
    
### Categorical Variables
- Understand distribution of each category, get a sense of distribution of records across the categories
    - One-way frequency or relative frequency table (count and percentage of values under a category).
- Visualization methods
    - Bar chart

In [12]:
# One-way frequency table
pd.crosstab(index=df_titanic["Survived"], columns="count")

col_0,count
Survived,Unnamed: 1_level_1
0,549
1,342


In [11]:
# One-way relative frequency table
pd.crosstab(index=df_titanic["Survived"], columns="frequency", normalize=True)

col_0,frequency
Survived,Unnamed: 1_level_1
0,0.616162
1,0.383838


In [13]:
# Bar chart

## Bi-variate Analysis
Find relationships between any two variables, continuous and categorical.

### Continuous and Continuous
Identify any linear or non-linear relationship (pattern) between two variables.

- Visualization methods
    - Scatter plot

<img src="images/correlations.png" alt="Correlations" style="width: 600px;"/>

Spearman, Pearson correlation
Correlation matrix
Correlation heatmap

Metrics
- Co-variance
- Variance
- Correlation coefficient

### Categorical and Categorical
#### Methods
Two-way `frequency or relative frequency tables` also known as `crosstabs` or `contingency` tables. A frequency table is just a data table that shows the counts of one or more categorical variables. Relative frequency tables show what percent of data points fit in each category.

In the example below ([source](https://www.khanacademy.org/math/statistics-probability/analyzing-categorical-data/two-way-tables-for-categorical-data/a/two-way-tables-review)) there are two variables - gender and preference - this is where the two in two-way frequency table comes from. Each cell tells us the number (or frequency).

<img src="images/two-way-frequency-table.png" alt="Correlations" style="width: 300px;"/>

Two-way relative frequency tables show what percent of data points fit in each category. We can use row relative frequencies or column relative frequencies, it just depends on the context of the problem.

<img src="images/two-way-relative-frequency-table.png" alt="Correlations" style="width: 600px;"/>

Sometimes your percentages won't add up to 100% even though we rounded properly. This is called `round-off error`, and we don't worry about it too much.

Two-way relative frequency tables are useful when there are different sample sizes in a dataset. In this example, more females were surveyed than males, so using percentages makes it easier to compare the preferences of males and females. From the relative frequencies, we can see that a large majority of males preferred dogs (78%) compared to a minority of females (41%).
    
    - Stacked column chart




In [18]:
# Two-way table
# Table of survival vs. sex
survived_sex = pd.crosstab(index=df_titanic["Survived"], columns=df_titanic["Sex"])
survived_sex.index= ["died", "survived"]
survived_sex

Sex,female,male
died,81,468
survived,233,109


In [20]:
# Table of survival vs passenger class
survived_class = pd.crosstab(index=df_titanic["Survived"], 
                            columns=df_titanic["Pclass"])

survived_class.columns = ["class1","class2","class3"]
survived_class.index= ["died","survived"]

survived_class

Unnamed: 0,class1,class2,class3
died,80,97,372
survived,136,87,119


In [22]:
# get the marginal counts (totals for each row and column) by including the argument margins=True
# Table of survival vs passenger class
survived_class = pd.crosstab(index=df_titanic["Survived"], 
                            columns=df_titanic["Pclass"],
                             margins=True)   # Include row and column totals

survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]

survived_class

Unnamed: 0,class1,class2,class3,rowtotal
died,80,97,372,549
survived,136,87,119,342
coltotal,216,184,491,891


In [23]:
survived_class = pd.crosstab(index=df_titanic["Survived"], 
                            columns=df_titanic["Pclass"],
                             margins=True, normalize=True)   # Include row and column totals

survived_class.columns = ["class1","class2","class3","rowtotal"]
survived_class.index= ["died","survived","coltotal"]

survived_class

Unnamed: 0,class1,class2,class3,rowtotal
died,0.089787,0.108866,0.417508,0.616162
survived,0.152637,0.097643,0.133558,0.383838
coltotal,0.242424,0.20651,0.551066,1.0


`Stacked column chart`

In [24]:
# Stacked column chart

`Chi_Square Test` is used to derive statistical significance of relationship between varables. It also tests whether the evidence in the sample is strong enough to generalize the relationship for a larger population. It returns probability of the computed chi-square distribution with the degree of freedom.
- Probability of 0: both categorical variables are dependent
- Probability of 1: independent
- Probability less than 0.05: indicates that the relationship between the variables is significant at 95% of confidence

In [None]:
# Chi-Square Test

Other statistical measures used to analyze the power of relationship are:
- Cramer's V for Nominal Categorical Variable
- Mantel-Haenszed Chi-Square for ordinal categorical variable

### Categorical and Continuous
To explore relation between a categorical and continuous variables, we can draw box plots for each level of categorical variables. If levels are small in number, there is no statistical significance.
#### Methods
- Z-test - tests if means of two groups are statistically different from each other 
- T-test - like Z-test but for categories with less than 30 samples each
- ANOVA - assesses if the average of more than two groups is statistically different

## Missing Values Handling
Missing data in the training datase can reduce the power / fit of a model or can lead to a biased model as the data do not present relationships between variables correctly. Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.

The most often reason for missing data are related to:
- **Data collection**. Difficult and usually time consuming to correct.
- **Data extraction**. Easy to find and corrected, mechanisms like hashing may help in ensuring that the data are extracted correctly.

<img src="images/missing-values-handling.png" alt="Correlations" style="width: 600px;"/>

[source](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4)

The missing values treatment can be as follows:
- **Listwise or pairwise deletion**. It is important to understand that in the vast majority of cases, an important assumption to using either of these techniques is that your data is missing completely at random (MCAR). 

    In the `listwise deletion` we delete observations (or columns) where any of the variable is missing. It is simple method but reduces the power of model by reducing the sample size. 
    
    `Pairwise deletion` occurs when the statistical procedure uses cases that contain some missing data. The procedure cannot include a particular variable when it has a missing value, but it can still use the case when analyzing other variables with non-missing values. Pairwise deletion allows you to use more of your data. However, each computed statistic may be based on a different subset of cases.

    The choice between pairwise and listwise deletion of records is limited. The choice between these two types of deletion is not relevant when only one variable is being analyzed. In other situations, missing values may be treated as a valid category. 
    
- **Mean / mode / median imputation**. Imputation means filling the missing values with estimated (most frequently used) ones. It consists of replacing the missing data for a given attribute quantitavely (mean, median) or qualitatively (mode). Mean imputation is one of the most ‘naive’ imputation methods because unlike more complex methods like k-nearest neighbors imputation, it does not use the information we have about an observation to estimate a value for it. The imputation can take forms of:

    - `Generalized imputation`. We calculate mean, median or mode for all non missing values of a variable and replace all missing values of this variable with the result.
    - `Similar case imputation`. We calculate mean, median or mode for similar cases only (looking at similarity of other variables of other cases without missing data for the variable in question).
    

- **Prediction model**. We create a predictive model (linear / logistic regression, tree, etc.) to estimate values that will substitute the missing data. To do this, we divide our dataset into two parts: one with no missing values for the variable (training dataset with a target variable), another with missing values for the variable (test dataset to pedict the target / missing variable). We populate the missing values with the predicted ones.

- **KNN Imputation**. The missing values of a variable are imputed using the given number of variables that are most similar to the attribute whose values are missing. The similarity is defined as a distance function.

    - Advantages: 
        - KNN can predict both qualitative and quantitative variables
        - Creation of a predictive model for each variable with missing data is not required
        - Variables with multiple missing values can be easily treated
        - Correlation of data is taken into consideration
    - Disadvantages:
        - KNN is time consuming
        - Choice of k-value is very critical 

In [None]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

# remove all the rows that contain a missing value
nfl_data.dropna()

# remove all columns with at least one missing value
columns_with_na_dropped = nfl_data.dropna(axis=1)
columns_with_na_dropped.head()

# just how much data did we lose?
print("Columns in original dataset: %d \n" % nfl_data.shape[1])
print("Columns with na's dropped: %d" % columns_with_na_dropped.shape[1])

In [2]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
data_with_imputed_values = my_imputer.fit_transform(original_data)

# make copy to avoid changing original data (when Imputing)
new_data = original_data.copy()

# make new columns indicating what will be imputed
cols_with_missing = (col for col in new_data.columns 
                                 if new_data[col].isnull().any())
for col in cols_with_missing:
    new_data[col + '_was_missing'] = new_data[col].isnull()

# Imputation
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(new_data))
new_data.columns = original_data.columns

# example on how to use imputer: https://www.kaggle.com/dansbecker/handling-missing-values/notebook

NameError: name 'original_data' is not defined

In [None]:
# Mean imputation
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer

# Create an empty dataset
df = pd.DataFrame()

# Create two variables called x0 and x1. Make the first value of x1 a missing value
df['x0'] = [0.3051,0.4949,0.6974,0.3769,0.2231,0.341,0.4436,0.5897,0.6308,0.5]
df['x1'] = [np.nan,0.2654,0.2615,0.5846,0.4615,0.8308,0.4962,0.3269,0.5346,0.6731]

# View the dataset
df

# Fit imputer
# Create an imputer object that looks for 'Nan' values, then replaces them with the mean value of the feature by columns (axis=0)
mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

# Train the imputor on the df dataset
mean_imputer = mean_imputer.fit(df)

# Apply the imputer to the df dataset
imputed_df = mean_imputer.transform(df.values)

# View the data
imputed_df
# Notice that 0.49273333 is the imputed value, replacing the np.NaN value.

## Outliers

An `outlier` is an observation that appears far away and diverges from an overall pattern in a sample. Outliers can be of two types: univariate and multivariate. `Univariate outliers` can be found while looking at distribution of a single variable data. `Multivariate outliers` are outstanding observations in an n-dimensional space.

<img src="images/outlier.png" alt="Correlations" style="width: 300px;"/>

<img src="images/n-outlier.png" alt="Correlations" style="width: 600px;"/>

[source](https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/)

The method of dealing with outliers depend on the reason of their occurrence. Causes of outliers can be classified in two broad categories: artificial (error) / non-natural and natural:

- artificial: data entry errors (human errors, experimental / sampling errors, intentional outlier), measurement errors (faulty instruments), data processing errors.
- natural: not caused by error.

Outliers can drastically change the results of the data analysis and statistical modeling. There are numerous unfavourable impacts of outliers in the data set:

- They increase the error variance and reduces the power of statistical tests.
- If the outliers are non-randomly distributed, they can decrease normality.
- They can bias or influence estimates that may be of substantive interest.
- They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.

<img src="images/outliers-impact.png" alt="Correlations" style="width: 600px;"/>

Most commonly used method to detect outliers is visualization (like box-plot, histogram, scatter plot). Some analysts use various thumb rules to detect outliers:

- Any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR.
- Use capping methods. Any value which out of range of 5th and 95th percentile can be considered as outlier.
- Data points, three or more standard deviation away from mean are considered outlier.
- Outlier detection is merely a special case of the examination of data for influential data points and it also depends on the business understanding.
- Bivariate and multivariate outliers are typically measured using either an index of influence or leverage, or distance. Popular indices such as `Mahalanobis’ distance` and `Cook’s D` are frequently used to detect outliers.

Most of the ways to deal with outliers are similar to the methods of missing values like deleting observations, transforming them, binning them, treat them as a separate group, imputing values and other statistical methods:

- **Deleting observations**: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.

- **Transforming and binning values**: Transforming variables can also eliminate outliers. `Natural log of a value` reduces the variation caused by extreme values. `Binning` is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.

<img src="images/log-transformation.png" alt="Correlations" style="width: 600px;"/>

- **Imputing**: We can use mean, median, mode to change outlier values but before that, we should analyse if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.

- **Treat separately**: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.

## Feature engineering

Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful.

### Variables transformation

In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square or cube root, or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.

Reasons for variable transformation

- When we want to `change the scale` of a variable or `standardize the values` of a variable for better understanding. While this transformation is a must if you have data in different scales, this transformation does not change the shape of the variable distribution.

    Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes. To supress this effect, we need to bring all features to the same level of magnitudes. This can be acheived by scaling.

    There are four common methods to perform Feature Scaling:
    
    - **Standardization**. Standardization replaces the values by their Z scores. 
    
    <img src="images/standardization.png" alt="Correlations" style="width: 200px;"/>

    This redistributes the features with their mean μ = 0 and standard deviation σ =1. `sklearn.preprocessing.scale` helps us implementing standardisation in python.
    
    - **Mean Normalization**. This distribution will have values between -1 and 1 with μ=0.

    <img src="images/mean-normalization.png" alt="Correlations" style="width: 200px;"/>
    
    Standardisation and Mean Normalization can be used for algorithms that assumes zero centric data like `Principal Component Analysis (PCA)`.
    
    - **Min-Max Scaling**. This scaling brings the value between 0 and 1.
    
    <img src="images/min-max-scaling.png" alt="Correlations" style="width: 200px;"/>
    
    - **Unit Vector**. Scaling is done considering the whole feature vecture to be of unit length.
    
    <img src="images/unit-vector.png" alt="Correlations" style="width: 200px;"/>

    `Min-Max Scaling` and `Unit Vector` techniques produces values of range [0,1]. When dealing with features with hard boundaries this is quite useful. For example, when dealing with image data, the colors can range from only 0 to 255.
    
    Scale variables if using any algorithm that computes distance or assumes normality. For example:

    - `k-nearest neighbors` with an Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.
    - Scaling is critical, while performing `Principal Component Analysis (PCA)`. PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features.
    - We can speed up `gradient descent` by scaling. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
    
    But for below, scalling may not help:
    
    - `Tree based models` are not distance based models and can handle varying ranges of features. Hence, Scaling is not required while modelling trees.
    - Algorithms like `Linear Discriminant Analysis (LDA)` or `Naive Bayes` are by design equipped to handle this and gives weights to the features accordingly. Performing a features scaling in these algorithms may not have much effect.

- When we want to `transform complex non-linear relationships into linear relationships`. Existence of a linear relationship between variables is easier to comprehend compared to a non-linear or curved relation and also improves the prediction. `Log transformation` is one of the commonly used transformation technique used in these situations.

- Symmetric distribution is preferred over skewed distribution as it is easier to interpret and generate inferences. Some modeling techniques requires `normal distribution` of variables. For `right skewed distribution`, we take square / cube root or logarithm of variable and for `left skewed distribution`, we take square / cube or exponential of variables.

- When we want to group variables and represent their values in groups we use `binning`.

There are various methods used to transform variables:

- **Logarithm**: Log of a variable is a common transformation method used to change the shape of distribution of the variable. It is generally used for reducing right skewness of variables. Though, It can’t be applied to zero or negative values as well.
- **Square / Cube root**: The square and cube root of a variable has a sound effect on variable distribution. However, it is not as significant as logarithmic transformation. Cube root has its own advantage. It can be applied to negative values including zero. Square root can be applied to positive values including zero.
- **Binning**: It is used to categorize variables. It is performed on original values, percentile or frequency. Decision of categorization technique is based on business understanding. For example, we can categorize income in three categories, namely: High, Average and Low. We can also perform `co-variate binning` which depends on the value of more than one variables.

### Variables creation

Feature / Variable creation is a process to generate a new variables / features based on existing variable(s).

There are various techniques to create new features:

- **Creating derived variables**: This refers to creating new variables from existing variable(s) using set of functions or different methods.
- **Creating dummy variables**: One of the most common application of dummy variable is to convert categorical variable into numerical variables (0-1 encoding, or `one-hot-encoding`). Dummy variables are also called `Indicator Variables`.

The difference between `scaling` and `normalization` is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data. Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

`Normal distribution`: Also known as the "bell curve", this is a specific statistical distribution where a roughly equal observations fall above and below the mean, the mean and the median are the same, and there are more observations closer to the mean. The normal distribution is also known as the Gaussian distribution.

In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)

In [None]:
# Scale and normalize
# modules we'll use
import pandas as pd
import numpy as np

# for Box-Cox Transformation
from scipy import stats

# for min_max scaling
from mlxtend.preprocessing import minmax_scaling

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

# read in all our data
kickstarters_2017 = pd.read_csv("../input/kickstarter-projects/ks-projects-201801.csv")

# set seed for reproducibility
np.random.seed(0)

# Scaling
# generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000)

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

# Notice that the shape of the data doesn't change, but that instead of ranging from 0 to 8ish, it now ranges from 0 to 1.

# Normalization
# The method were using to normalize here is called the Box-Cox Transformation. 
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

# Notice that the shape of our data has changed. Before normalizing it was almost L-shaped. 
# But after normalizing it looks more like the outline of a bell (hence "bell curve").



### Character encoding

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding that the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

æ–‡å—åŒ–ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

In [None]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")

# check the type
type(after)



If you look at a bytes object, you'll see that it has a b in front of it, and then maybe some text after. That's because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.) Here you can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string.

In [None]:
# take a look at what the bytes look like
after

When we convert our bytes back to a string with the correct encoding, we can see that our text is all there correctly, which is great! :)

In [None]:
# convert it back to utf-8
print(after.decode("utf-8"))

However, when we try to use a different encoding to map our bytes into a string,, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a cd player. If you try to play a cassette in a CD player, it just won't work.

In [None]:
# try to decode our bytes with the ascii encoding
print(after.decode("ascii"))

We can also run into trouble if we try to use the wrong encoding to map from a string to bytes. Like I said earlier, strings are UTF-8 by default in Python 3, so if we try to treat them like they were in another encoding we'll create problems.

For example, if we try to convert a string to bytes for ascii using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn't in ASCII, though, there will be some characters it can't handle. We can automatically replace the characters that ASCII can't handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there's not way to tell which character it should have been. That means we may have just made our data unusable!

In [None]:
# start with a string
before = "This is the euro symbol: €"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

# We've lost the original underlying byte string! It's been 
# replaced with the underlying byte string for the unknown character :(

This is bad and we want to avoid doing it! It's far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files, which we'll talk about next.

First, however, try converting between bytes and strings with different encodings and see what happens. Notice what this does to your text. Would you want this to happen to data you were trying to analyze?

### Reading in files with encoding problems

Most files you'll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. However, sometimes you'll get an error like this:

In [None]:
# try to read in a file not in UTF-8
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv")

Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

I'm going to just look at the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. (Especially with a large file this can be very slow.) Another reason to just look at the first part of the file is that we can see by looking at the error message that the first problem is the 11th character. So we probably only need to look at the first little bit of the file to figure out what's going on.

In [None]:
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

So chardet is 73% confidence that the right encoding is "Windows-1252". Let's see if that's correct:

In [None]:
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

Yep, looks like chardet was right! The file reads in with no problem (although we do get a warning about datatypes) and when we look at the first few rows it seems to be be fine.

What if the encoding chardet guesses isn't right? Since chardet is basically just a fancy guesser, sometimes it will guess the wrong encoding. One thing you can try is looking at more or less of the file and seeing if you get a different result and then try that.

### Saving your files with UTF-8 encoding

Finally, once you've gone through all the trouble of getting your file into UTF-8, you'll probably want to keep it that way. The easiest way to do that is to save your files with UTF-8 encoding. The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

In [None]:
# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

### Working with dates

In [None]:
# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import datetime

# read in our data
earthquakes = pd.read_csv("../input/earthquake-database/database.csv")
landslides = pd.read_csv("../input/landslide-events/catalog.csv")
volcanos = pd.read_csv("../input/volcanic-eruptions/database.csv")

# set seed for reproducibility
np.random.seed(0)

# print the first few rows of the date column
print(landslides['date'].head())

# Pandas uses the "object" dtype for storing various types of data types, 
# but most often when you see a column with the dtype "object" it will have strings in it.

# check the data type of our date column
landslides['date'].dtype

# create a new column, date_parsed, with the parsed dates
# http://strftime.org/
landslides['date_parsed'] = pd.to_datetime(landslides['date'], format = "%m/%d/%y")

# print the first few rows (the dtype is datetime64)
landslides['date_parsed'].head()



What if I run into an error with multiple date formats? While we're specifying the date format here, sometimes you'll run into an error when there are multiple date formats in a single column. If that happens, you have have pandas try to infer what the right date format should be. You can do that like so:

In [None]:
landslides['date_parsed'] = pd.to_datetime(landslides['Date'], infer_datetime_format=True)

Why don't you always use infer_datetime_format = True? There are two big reasons not to always have pandas guess the time format. The first is that pandas won't always been able to figure out the correct date format, especially if someone has gotten creative with data entry. The second is that it's much slower than specifying the exact format of the dates.

In [None]:
# get the day of the month from the date_parsed column
day_of_month_landslides = landslides['date_parsed'].dt.day



One of the biggest dangers in parsing dates is mixing up the months and days. The to_datetime() function does have very helpful error messages, but it doesn't hurt to double-check that the days of the month we've extracted make sense.

To do this, let's plot a histogram of the days of the month. We expect it to have values between 1 and 31 and, since there's no reason to suppose the landslides are more common on some days of the month than others, a relatively even distribution. (With a dip on 31 because not all months have 31 days.) Let's see if that's the case:

In [None]:
# remove na's
day_of_month_landslides = day_of_month_landslides.dropna()

# plot the day of the month
sns.distplot(day_of_month_landslides, kde=False, bins=31)

### Inconsistent data entry


# References
- xxx