#### Import libraries


In [None]:
import pandas as pd
import numpy as np

#### Load the it_hardware_sales_week.csv file

In [None]:
df = pd.read_csv('../data/inputs/raw/it_hardware_sales_week.csv')
df.head()

Instead of always reading the top few rows of the dataset, we can look at a sample instead

In [None]:
df.sample(n=5, random_state=111)

#### Let's look at the basic details of the data set

We can check the dimensions of the data set with ```shape```

In [None]:
df.shape

So this data set has 100 rows and 5 columns

We'll use ```info()``` to get the data types and number of non-null values in a list of the features/columns

In [None]:
df.info()

We can use ```describe()``` with the ```include='all'``` parameter to see the summary statistics for numberic columns, and the count, number of unique values, and most frequent value for categorical columns

In [None]:
df.describe(include='all')

We can check if any columns have null values like this :

In [None]:
df.isna().sum()

#### Renaming columns

In [None]:
df_renamed = df.rename(columns={
    'Date': 'date',
    'SalesAgent': 'sales_agent',
    'Amount': 'amount',
    'Category': 'category',
    'Payment Type': 'payment_type'
})
df_renamed

Renaming columns, but with a dictionary comprehension to make it reuseable

In [None]:
df_renamed = df.rename(columns={
    col: col.replace(' ', '_').lower() for col in df.columns
})
df_renamed

#### Looking things up

How can we look at individual rows/columns in the dataset?

In [None]:
df[['SalesAgent', 'Date']].head(3)

What if we want to look at a particular row?

In [None]:
df.loc[34]

Or a particular row and column

In [None]:
df.loc[34, 'SalesAgent']

We can also use ```iloc``` to specify rows and columns by index, and also use a similar slicing syntax to python

In [None]:
df.iloc[10:20, 2:4]

We can select rows based on a condition

In [None]:
df[df['SalesAgent'] == 'Bob']

In [None]:
df[df['Amount'] > 1000]

And the same result using ```query()```

In [None]:
df.query('Amount > 1000')

We can also use ```isin()```

In [None]:
df_filtered = df[
    df['Category'].isin(['Desktop', 'Laptop']) &
    ~df['SalesAgent'].isin(['Chris', 'Bob'])
]
df_filtered

#### Grouping data

We can group data using the ```groupby()``` function

In [None]:
grouped = df.groupby(by=['Category', 'SalesAgent'])['Amount'].agg(['sum', 'mean', 'count', 'max'])
grouped

Show the data for all agents in one category

In [None]:
grouped.loc['Laptop']

We can provide a custom aggregation function too

In [None]:
def get_range(x):
    return x.max() - x.min()

grouped = df.groupby(['Category', 'SalesAgent'])['Amount'].agg(
    total_sales='sum',
    average_sale='mean',
    range=get_range
)
grouped


#### Adding and changing columns :

We can create a new column - let's create a Day of Week column

First we need to convert the Date column into the correct type

In [None]:
df['Date'].dtype

In [None]:
df['Datetime'] = pd.to_datetime(df['Date'], errors='raise')
df

In [None]:
df['DayOfWeek'] = df['Datetime'].dt.day_name()
df

### Applying a function to a dataframe

In [None]:
def is_weekday(date):
    return date.day_name() not in ['Saturday', 'Sunday']

df['IsWeekday'] = df['Datetime'].apply(is_weekday)
df

Pandas date columns are instances of Timestamp - you can find the docs here - 
[Timestamp documentation](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.html)

### Long v Wide

We can use the ```pivot_table``` function to convert our dataframe into a wide format. The aggregation method is 'mean' by default.

In [None]:
df.head()

In [None]:
pivot = pd.pivot_table(
    data=df,
    index='SalesAgent',
    columns='Category',
    values='Amount',
    aggfunc='sum'
)

pivot

And we can convert it back to a long format again

In [None]:
pivot_reset = pivot.reset_index()

melted = pd.melt(
    pivot_reset,
    id_vars='SalesAgent',
    var_name='Category',
    value_name='Amount'
)

melted