# TRANSACTIONS TO PARTIAL RESULTS:

This script intends to make an exploratory data analysis on the partially-treated data to jump to our first conclussions.

# 1. IMPORTING PACKAGES AND THE INFORMATION:


In [None]:
# Importing packages:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

In [None]:
#Defining the search path of the file, the name and the separator:

file_path = "../../data/01_raw/"
file_name = "b2-transactions.csv"

sep=";"


# Provisional file:

total_sales_results_per_id="total_sales_results_per_id.csv"
total_sales_results_per_id_and_store="total_sales_results_per_id_and_store.csv"

In [None]:
# Now, we import the file and storing it in df: 
# (at first, we only import several thousand rows)

df=pd.read_csv(file_path+file_name, nrows=1000000, sep=sep)
df.head()

In [None]:
df.shape

In [None]:
df.columns

# 2. CHECKING FOR NULL VALUES:

In [None]:
# Checking if is there any null values:

df.isnull().values.any()

In [None]:
# Also checking the na values:

df.isna().values.any()

In [None]:
# We can construct a vector for selecting those rows that have any missing data:

# First we create a boleean array that tells us wether the row has missing data:

missing_data_check=False
for column in df.columns:
    missing_data_check = missing_data_check |df[column].isnull()
    
# We can now slice the column to get only those columns that have missing data:

array_of_missing_values=missing_data_check[missing_data_check==True]

In [None]:
# We now check the lenght of this resulting vector:

len(array_of_missing_values)

In [None]:
# After checking that the % of the missing data rows is despicable (for 1000000 rows we get 1056, or 0.1%), we decide to drop them:

df.dropna(how='any', inplace=True)

In [None]:
# We drop the 'Unnamed: 0' column, due to it seems to be an old index made column, and its information is redundant:

df=df.drop('Unnamed: 0', axis=1)
df.head(10)

# 3. FIRST EVALUATIONS OF THE DATA:

## 3.1. Checking the number of different ids and plotting their count:

In [None]:
# Taking a look of how many different ids exist and their totals (in our 10.000 rows, of course):

prod_and_num_trans=df.groupby('product_id').count()['description'].sort_values(ascending=False)
prod_and_num_trans

In [None]:
# We want to take a look for checking how many different values exist in our df:

len(prod_and_num_trans)

# The result is that in the first 100.000 rows we have 1283 unique ids.

In [None]:
best_sellers_list=prod_and_num_trans.iloc[0:50]
best_sellers_list

In [None]:
names=[str(x) for x in best_sellers_list.index]
plt.bar(names, height=best_sellers_list.values);

## 3.2. Adding all the orders for each id and getting first totals and sells share:

## 3.2.1. First problem:

We face our first problem here. The information on units oredered is not of type integer, as we would like to have it, but
it is a string. We have to convert it appropriately before going on:

In [None]:
type(df['units_ordered'][5])

In [None]:
# Quick check on the different positions the comma might be at:

comma_positions=df['units_ordered'].str.find(",")
comma_positions.unique()

In [None]:
# The 'units_ordered' column is a string, that cannot be easily converted to integer because
# their numbers are in continental format ("," instead of "." for decimals).

# So, we decide to separate the string by the comma, take the first partition and store it in the df as an integer (long, in
# provision of numbers in the order of magnitude of the limit of the standard 'int' ~ 31500):

df['units_ordered_numeric']=df['units_ordered'].str.split(",").str[0].astype(dtype='long')

df.columns

## 3.2.2. Getting the totals:

In [None]:
# We take a quick glance to the results to check that everything is fine:

df.head(20)

In [None]:
# Proceeding to check the products:

# We want to group by id and description no. What we want to check now is:

    # Wheter there are ids assigned to several product or there are not
    # If an id is assigned to several product, we want to check if there is a logical relationship among those products
    # The quantities of the products bought along the lines we have selected

totals_by_id_description=df.groupby(['product_id', 'description'], as_index=False).sum().sort_values('units_ordered_numeric', ascending=False)
totals_by_id_description.head()

In [None]:
total_sales=totals_by_id_description.sum()[3]
total_sales

## 3.2.3. Getting total shares:

In [None]:
totals_by_id_description['sells_share']=totals_by_id_description['units_ordered_numeric']/total_sales

totals_by_id_description.head()

## 3.2.3. Second problem:

We have a slight problem with the data, which is the relation id-description is not unique, as e can see below:


In [None]:
totals_by_id_description[totals_by_id_description['product_id']==245].head(20)

We then, proceed to count and order:

In [None]:
# An accesory table is created to store the counting for each id, then this table is 
# adjoined to our main df: totals_by_id_description

accesory_table_1=totals_by_id_description.groupby('product_id').count()
accesory_table_1.columns=['count', 'count2', 'count3', 'count4']

totals_by_id_description.merge(accesory_table_1['count'], on='product_id').sort_values('count', ascending=False).head(20)

We can see that there is a product id just forf orders (9999), that has cannot be specified as a unique product.

On the other hand, we see that this product id has certain particularities: for instance, there are lots of orders with 0 units
ordered, which seems extrange.

In [None]:
filter1=(totals_by_id_description['product_id']==9999) &  (totals_by_id_description['units_ordered_numeric']==0)
totals_by_id_description[filter1].head(10)

'

So, lets take a look to the relationship between the id and its description:


'

In [None]:
dif_id_description_matches=totals_by_id_description[['product_id','description']].sort_values('product_id', ascending=True)
dif_id_description_matches.head()

In [None]:
prods_per_id=dif_id_description_matches.groupby('product_id', as_index=False).count().sort_values('description', ascending=False)

In [None]:
prods_per_id.head()

In [None]:
ppi50=prods_per_id[1:50]

names2=[str(x) for x in ppi50['product_id']]

plt.bar(names2, ppi50['description']);

In [None]:
#Keeping these lines just in case:

# searching_for_unique=totals_by_id_description[['product_id','description']]
# searching_for_unique['joined_cols']=searching_for_unique['product_id'].apply(str)+"/"+searching_for_unique['description']

'


These results make us think again the groupby used.

Is it, perhaps, more useful to group the products just by id?



'

In [None]:
accesory_table2=df.groupby(['product_id'], as_index=False).first()[['product_id','description']]
df.head()

In [None]:
totals_by_id=df.groupby(['product_id'], as_index=False).agg(sum('units_oredered_numeric').alias('total'), first('description'), count('description').alias('num_rows')).sort_values('total', ascending=False)

In [None]:
dict2

In [None]:
dict1={'units_ordered_numeric':'sum','description':'first','units_ordered':'count'}

totals_by_id=df.groupby(['product_id'], as_index=False).agg(dict1).sort_values('units_ordered_numeric', ascending=False)

list1=['product_id', 'total_orders', 'description', 'number_of_different_names']

totals_by_id.columns=list1

totals_by_id.head(20)

In [None]:
dict1={'units_ordered_numeric':'sum','description':'first','units_ordered':'count'}

totals_by_id_and_store=df.groupby(['product_id', 'store'], as_index=False).agg(dict1).sort_values(['store','units_ordered_numeric'], ascending=False)

list2=['product_id', 'store', 'total_orders', 'description', 'number_of_different_names']

totals_by_id_and_store.columns=list2

totals_by_id_and_store.head(20)

# 4. ENDING:

After all this process, we have learned a few things:

-We have data related to orders of several products and stores. The products are marked by an id and a description.

-Our data has null values, but very few, and we have then decided to discard them.

-Our data has not a solid relationship between id and description of the products. In general, an id is assigned to many "similar" products. In a few cases, it has been noticed that an id is given to two disimilar products (we are assuming that this is due to human mistake).

-Also, there is an id (9999) that, as it is used for direct orders from customers, is assigned to a lot of different products. We could try to reassign this products by its description to their other suitable id, or disregard the whole id. As we were asked not to take into account the direct online orders from customers, the solution should be to not use this id.

-Appart from the stated, id seems a better indicator than description, for grouping the aforementioned products.

-Some additional checks should be done to adress the suitability of the id as indicator of the product. Specifically, it would be interesting to check the behaviour of id against a manual filter based on some keywords.



Taking into consideration that we have operated this script over one million lines, the results in terms of comparison of the sales of the different products should be relatively reliable.

In this spirit is why we export the two last dataframes.

In [None]:
# We can now store these results in a csv, for sending, if it is convenient 
# (taking into account that these are not results obtained on the total of the information given):

totals_by_id.to_csv(file_path+total_sales_results_per_id, sep=sep)
totals_by_id_and_store.to_csv(file_path+total_sales_results_per_id_and_store, sep=sep)