### Problem-2: 

#### Trade Forecasting
__Welcome to Antallagma__ - a digital exchange for trading goods. Antallagma started its operations 5 years back and has supported more than a million transactions till date. The Antallagma platform enables working of a traditional exchange on an online portal.

On one hand, buyers make a bid at the value they are willing to buy ("bid value”) and the quantity they are willing to buy. Sellers on the other hand, ask for an ask price and the quantity they are willing to sell. The portal matches the buyers and sellers in realtime to create trades. All trades are settled at the end of the day at the median price of all agreed trades.

You are one of the traders on the exchange and can supply all the material being traded on the exchange. In order to improve your logistics, you want to predict the median trade prices and volumes for all the trades happening (at item level) on the exchange. You can then plan to use these predictions to create an optimized inventory strategy.

You are expected to create trade forecasts for all items being traded on Antallagma along with the trade prices for a period of 6 months.

__Evaluation Criteria:__
Overall Error = Lambda1 x RMSE error of volumes + Lambda2 x RMSE error of prices Where Lambda1 and Lambda2 are normalising parameters

__Description:__
There were in-total of 1529 unique stocks in the data.
The data is from jan-2014 to june-2016. Divide the data into train and test (Jan-2016 to June-2016). Build your models on train and present your final scores on test.
Category_1, Category_2, Category_3 are Binary masked feature, Ordered Masked feature, Unordered Masked feature respectively.
Price (Median Price at Sale on that day), Number_Of_Sales (Total Item Sold on that day) are two target variables.

Students can use both machine learning and data driven models to solve this problem.

__Important links:__
* https://github.com/Prakashvanapalli/av_july_2017
* https://datahaccurl
* http://localhost:8080/words/top/2k.analyticsvidhya.com/contest/fractal-analytics-hiring-hackathon/

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

%matplotlib inline

## EDA

In [2]:
df = pd.read_csv('train.csv', index_col=0)
df.head()

Unnamed: 0_level_0,Item_ID,Datetime,Category_3,Category_2,Category_1,Price,Number_Of_Sales
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
30495_20140101,30495,2014-01-01,0,2.0,90,165.123,1
30375_20140101,30375,2014-01-01,0,2.0,307,68.666,5
30011_20140101,30011,2014-01-01,0,3.0,67,253.314,2
30864_20140101,30864,2014-01-01,0,2.0,315,223.122,1
30780_20140101,30780,2014-01-01,1,2.0,132,28.75,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 881876 entries, 30495_20140101 to 30132_20160630
Data columns (total 7 columns):
Item_ID            881876 non-null int64
Datetime           881876 non-null object
Category_3         881876 non-null int64
Category_2         790263 non-null float64
Category_1         881876 non-null int64
Price              881876 non-null float64
Number_Of_Sales    881876 non-null int64
dtypes: float64(2), int64(4), object(1)
memory usage: 53.8+ MB


In [4]:
df['DateTime_pd'] = pd.to_datetime(df['Datetime'])

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 881876 entries, 30495_20140101 to 30132_20160630
Data columns (total 8 columns):
Item_ID            881876 non-null int64
Datetime           881876 non-null object
Category_3         881876 non-null int64
Category_2         790263 non-null float64
Category_1         881876 non-null int64
Price              881876 non-null float64
Number_Of_Sales    881876 non-null int64
DateTime_pd        881876 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(4), object(1)
memory usage: 60.6+ MB


In [79]:
df['Category_2'].value_counts()

2.0    227166
3.0    212388
1.0    140098
4.0    106903
5.0    103708
Name: Category_2, dtype: int64

In [80]:
ndf = df.dropna().copy()

In [82]:
ndf.drop(['Datetime'], axis=1, inplace=True)
ndf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 790263 entries, 30495_20140101 to 30305_20160630
Data columns (total 7 columns):
Item_ID            790263 non-null int64
Category_3         790263 non-null int64
Category_2         790263 non-null float64
Category_1         790263 non-null int64
Price              790263 non-null float64
Number_Of_Sales    790263 non-null int64
DateTime_pd        790263 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(4)
memory usage: 48.2+ MB


## ML Functions

In [64]:
def do_ml(df, features, target, model, test_size=0.01):
    X = df[features].copy()
    y = df[target].copy()
    X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=test_size, random_state=42)
    model = model(n_jobs=-1)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return y_test, y_pred

def metrics(y_test, y_pred):
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    return r2, mse, rmse

def visualize(y_test, y_pred):
    plt.figure(12,8)
    plt.scatter(y_test, y_pred)

### Model Execution

In [83]:
features = ['Category_1', 'Category_2', 'Category_3', 'Price']
target = 'Number_Of_Sales'

In [84]:
y_test, y_pred = do_ml(df, features, target, LinearRegression, test_size=0.1)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [92]:
r2, mse, rmse = metrics(y_test, y_pred)
print('R2 Score = %f' %r2) 
print('Root Mean Square Error = %f' %rmse)

R2 Score = 0.011871
Root Mean Square Error = 7437.914980


In [94]:
df['Datetime'].head()

ID
30495_20140101    2014-01-01
30375_20140101    2014-01-01
30011_20140101    2014-01-01
30864_20140101    2014-01-01
30780_20140101    2014-01-01
Name: Datetime, dtype: object

In [98]:
pd.Timestamp(df['DateTime_pd'])

TypeError: Cannot convert input [ID
30495_20140101   2014-01-01
30375_20140101   2014-01-01
30011_20140101   2014-01-01
30864_20140101   2014-01-01
30780_20140101   2014-01-01
30927_20140101   2014-01-01
31342_20140101   2014-01-01
30540_20140101   2014-01-01
29999_20140101   2014-01-01
30068_20140101   2014-01-01
30541_20140101   2014-01-01
30602_20140101   2014-01-01
30622_20140101   2014-01-01
30825_20140101   2014-01-01
31012_20140101   2014-01-01
31062_20140101   2014-01-01
29841_20140101   2014-01-01
30903_20140101   2014-01-01
31308_20140101   2014-01-01
31288_20140101   2014-01-01
30317_20140101   2014-01-01
31193_20140101   2014-01-01
30531_20140101   2014-01-01
30933_20140101   2014-01-01
30111_20140101   2014-01-01
30835_20140101   2014-01-01
30108_20140101   2014-01-01
30995_20140101   2014-01-01
31285_20140101   2014-01-01
30093_20140101   2014-01-01
                    ...    
31051_20160630   2016-06-30
29952_20160630   2016-06-30
30720_20160630   2016-06-30
30168_20160630   2016-06-30
30232_20160630   2016-06-30
30078_20160630   2016-06-30
30570_20160630   2016-06-30
30984_20160630   2016-06-30
30830_20160630   2016-06-30
30300_20160630   2016-06-30
31216_20160630   2016-06-30
30143_20160630   2016-06-30
29930_20160630   2016-06-30
30600_20160630   2016-06-30
29923_20160630   2016-06-30
30876_20160630   2016-06-30
31007_20160630   2016-06-30
30781_20160630   2016-06-30
30286_20160630   2016-06-30
30509_20160630   2016-06-30
29835_20160630   2016-06-30
29818_20160630   2016-06-30
31034_20160630   2016-06-30
31099_20160630   2016-06-30
30047_20160630   2016-06-30
31009_20160630   2016-06-30
30807_20160630   2016-06-30
30305_20160630   2016-06-30
31036_20160630   2016-06-30
30132_20160630   2016-06-30
Name: DateTime_pd, Length: 881876, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp