## Name: Md Mehedi Hasan
## Email: mehedi2003@gmail.com

In [1]:
# Required library
import pandas as pd
import dateutil
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from scipy import stats

# Load data from csv file
data = pd.DataFrame.from_csv('screening_exercise_orders_v201810.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse)



In [2]:
# Sort customers by their ids and display the first 10 rows
sorted_customers = data.groupby(['customer_id', 'gender'], sort=True).agg({'value': "count", 'date': 'first'})
print(sorted_customers.head(10))

                                  date  value
customer_id gender                           
1000        0      2017-01-01 00:11:31      1
1001        0      2017-01-01 00:29:56      1
1002        1      2017-01-01 01:30:31      3
1003        1      2017-01-01 01:34:22      4
1004        0      2017-01-01 03:11:54      1
1005        1      2017-01-01 10:08:05      2
1006        1      2017-01-01 15:42:57      3
1007        0      2017-01-01 15:59:50      1
1008        0      2017-01-01 18:01:04      3
1009        1      2017-01-01 19:27:17      1


In [3]:
# Plot order counts per week
ax = plt.gca()
data['week'] = data['date'].apply(lambda x: x.week)
orders_per_week = data.groupby(['week']).agg({'value': "count"})
orders_per_week.plot(kind='line', ax=ax)
plt.ylabel('# of orders')
plt.xlabel('week')
plt.show()

In [4]:
# Mean value by gender
mean_order_value = data.groupby(['gender']).agg({'value': "mean"})
print(mean_order_value)

             value
gender            
0       363.890006
1       350.708361


In [5]:
# t-test for significance
t_test = stats.ttest_ind(data[data['gender'] == 0]['value'], data[data['gender'] == 1]['value'], equal_var=False)
print(t_test)

Ttest_indResult(statistic=1.976107933576866, pvalue=0.04816296295128402)


* Based on t-test, we find that p-value is less than 0.05. Therefore, their order mean is significantly different.

In [None]:
# Confusion_matrix
confusion_mtx = confusion_matrix(data['gender'], data['predicted_gender'])
print(confusion_mtx)

* For a single gender prediction, the above model performed poorly. Especially, classifying the gender 0, which provides less than 50% accuracy (more than 50% are false negative). Even, a random guess will provide the similar results for gender 0. It also mean that the model is slightly biased for gender 1.

* My favorite technique to solve a problem is to fit a best model, where F1-Measure is reasonable and highly acceptable by the standard. Based on the given dataset, I tried to build a state-of-the-art machine learning model. Data visualization helped me a lot for understanding the problem first. After that, cross-validated macro-averaged F1-Measure results helped me to understand the overall performance of the model. I also check the strength of the model by using AUC value under the ROC curve. More often I used panda, scikit learn and numpy packages in a python programming language. 