## **Introduction**
This Kernel covers the basic pandas commands, every ML engineer or Data Scientist should know. Therefore it is intended for beginners. By using these commands we will transform and clean the Rossman Store Sales Dataset. Some transformation I will do in this Kernel doesn't really makes sense out of a machine learning persepective, I just do them for illustrational purposes of pandas. I will use the steps described in the Youtube Tutiorial: "Introduction To Data Analytics With Pandas" from Quentin Caudron, but with the Rossmann Data set. 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------


In [None]:
import pandas as pd
import numpy as np
import matplotlib

In [None]:
train_df = pd.read_csv("../input/train.csv")

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------


# **Data Exploration**
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------


In [None]:
# .head() returns the first 5 rows of a dataset
train_df.head()

In [None]:
# .tail() returns the last 5 rows of a dataset
train_df.tail()

In [None]:
# .info() shows generell information about the datafram like total entrie number, 
# total number of features, feature types etc.
train_df.info()

In [None]:
# .iloc[] returns a specific row of the dataframe. Just put in the index you wish to see.
# there is also .loc[] which is for selection by label, but also used with a boolean array
train_df.iloc[2]

In [None]:
# .describe() shows you several statistical stats about your dataset.
train_df.describe()

## **Missing Data ?**

Let's look if you we have some missing data in our dataset. 

In [None]:
# .isnull() detects missing values.
# .sum() sums values up 
# .max()  returns the maximum of the values in the object
# By using these 3 methods together we can easily see how many missing values our data contains.
train_df.isnull().sum().max() 

## **Changing pandas dtypes**

In [None]:
# dtypes shows you all features of the dataset and at what type panadas stored them
train_df.dtypes

In [None]:
print(train_df.Date[0])
# type() returns the type of the input
print(type(train_df.Date[0]))

The Date feature is stored as a string. We will convert it into a pandas Datetime object, so that it is easier to work with.

In [None]:
# pd.to_datetime() transform it into a datetime object
train_df.Date = pd.to_datetime(train_df.Date)
# confirm the types
train_df.dtypes

## **Converting Values**

In [None]:
# train_df.StateHoliday selects the StateHoliday feature
# value_counts() returns how many different values a feature has and counts how often they occur.
train_df.StateHoliday.value_counts()

As you can see the StateHoliday feature contains not only numbers. Because of that we will convert the letters into numeric values.

In [None]:
# create a mapping dictionary
mapping_dictionary = {"StateHoliday": {"a": 1, "b": 2, "c": 3}}

In [None]:
# .replace() replaces the values. 
train_df.replace(mapping_dictionary, inplace = True)

In [None]:
# Let's see if it worked:
train_df.StateHoliday.value_counts()

In [None]:
# Let's check the dtype again.
train_df.dtypes

We succesfully converted the features values but pandas has the StateHoliday feature still stored as an object dtype. We will convert it into an int64.

In [None]:
# astype() transform the dtype, at this example into an integer (int64)
train_df.StateHoliday = train_df.StateHoliday.astype(int)

In [None]:
train_df.dtypes

In [None]:
train_df.head()

## **Creating new features and using the Datetime object**

The DayOfWeek feature stores the day as a number, which is good to put it into an algorithm but out of illustrational purposes we will create a new feature that contains the actual weekday as a string. We will delete the DayofWeek feature because it didn't works inline with the Datetime object, which is way easier to work with.

In [None]:
# .drop() to drop the DayOfWeek feature
train_df = train_df.drop("DayOfWeek", axis=1)

In [None]:
# Create a series for the weekdays for each entry using dt.weekday. 
# Pandas automatically finds the right day to a specific data because we previously 
# transformed the Date feature into a Datetime object.
weekdays = train_df.Date.dt.weekday
# assign() assigns the new weekdays feature to our dataframe.
train_df = train_df.assign(weekdays = weekdays)

In [None]:
train_df.head()

We now have a new feature that stores the day of week, called: weekdays. Now we will transform the numbers it contains into actual weekdays.

In [None]:
# creating a list of the days
weekday_names = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
# for-loop to assign these days
weekday_dict = {key: weekday_names[key] for key in range(7)}

# fucntion to actually replace the numbers with the days
def day_of_week(idx):
    return weekday_dict[idx]
# use apply() to apply our function to the weekdays column
train_df.weekdays =  train_df.weekdays.apply(day_of_week)

In [None]:
train_df.weekdays.value_counts()

## **Grouping data by a Feature**
Let's group these weekdays.

In [None]:
# groupby() groups our weekdays and  count() counts the rows in each group
weekday_counts = train_df.groupby("weekdays").count()

# We can reorder this dataframe by our weekday_names list
weekday_counts = weekday_counts.loc[weekday_names]

weekday_counts

In [None]:
train_df.head()

## **Changing the index**

In [None]:
# .index to set the index equal to the Date feature
train_df.index = train_df.Date
# Let's drop the "old" Date Feature because we no longer need it since it's values are 
# now the index.
# .drop() to drop the feature
train_df.drop(["Date"], axis = 1, inplace = True)

In [None]:
train_df.head()

Instead of 0, 1, 2, 3, 4... we now have the actual Dates as index on the left of the dataframe. 

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
