* ### **Importing Libraries**
  Add libraries as and when needed in the analysis.

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

* ### **Reading the inputs**
  We can read the given training and testing datasets using `read_csv` from Pandas.
  

In [None]:
train_file_path = '../input/amazon-employee-access-challenge/train.csv'
train_df = pd.read_csv(train_file_path)

test_file_path = '../input/amazon-employee-access-challenge/test.csv'
test_df = pd.read_csv(test_file_path)

* ### **Understanding the data roughly**
    `info()`, `head()`, `tail()`, `shape` and `describe()` can be used.  
     Handle the null values (if any).
     
   

In [None]:
train_df.info()

The only datatype is `int64` (no categorical data), so we won't need to create dummy columns just to map the strings (This can be accomplished using `get_dummies` if needed). All features have numerical data, so `describe()` shows all of our columns.  

**(Not needed for our dataset)**

The `ColumnTransformer` class from `sklearn.compose` is a pipeline usually used to automate two things:
* imputing missing values in numerical data  
* imputing missing values and applies proper encoding to categorical data

In [None]:
train_df.shape 

32769 employees, 9 features

In [None]:
train_df.head()

In [None]:
train_df.describe()

We see that the 'Non-Null Count' is equal to the number of rows in our training data for each of the column. Hence, we conclude that there are no null values. We could check this anyway using `.isnull().values.any()` and if it returns 'True', we can fill the fields appropriately using `fillna`. The `isnull().sum()` gives count of the number of null values for each column.

In [None]:
train_df.isnull().values.any()

* ### **Understanding the dataset using graphs**
  Plot using **Matplotlib**, **Seaborn** and try to infer.  
  Decide if any feature is **irrelevant** and is less likely to contribute to the outcome.   
  Check if any of the features are **correlated**.   
  

In [None]:
# cols = ['ACTION','RESOURCE','MGR_ID','ROLE_ROLLUP_1','ROLE_ROLLUP_2','ROLE_DEPTNAME','ROLE_TITLE','ROLE_FAMILY_DESC','ROLE_FAMILY','ROLE_CODE']
cols= train_df.columns
for i in cols:
    train_df.hist(i)

    

Almost all of the features have quite concentrated values.

In [None]:
for p in cols:
 n = len(pd.unique(train_df[p]))
 
 print(p,n)

`ROLE_FAMILY`, `ROLE_ROLLUP_1` & `ROLE_ROLLUP_2` are some features with relatively less unique values. So, we plot their scatterplots only.

In [None]:
a=['ROLE_FAMILY','ROLE_ROLLUP_1','ROLE_ROLLUP_2']

for j in a:
      group = train_df.groupby(j) 
      plt.figure(figsize=(20,8))
      plt.plot(group['ACTION'].agg(np.mean),'ro')
      plt.xlabel(j)
      plt.ylabel('Mean ACTION-->')
      plt.show()
    
    
# Used usual scatterplots before
plt.figure(figsize=(20,8))
sns.scatterplot(x=train_df[a[0]], y=train_df['ACTION'])

Now, we check if any of the features are correlated (`corr()` can be used).

In [None]:
print(train_df.corr())
sns.heatmap(train_df.corr())

As long as the features are distinct (i!=j), we don't see any light areas indicating any major correlation in the datatset.  
Some of the features that are slightly related:  
`ROLE_TITLE` & `ROLE_CODE`   
`ROLE_TITLE` & `ROLE_FAMILY_DESC` .

Let's check exact values of correlation with our label 'ACTION'.


In [None]:
train_df.corr()['ACTION'].sort_values(ascending=False)

In [None]:
train_df['ROLE_ROLLUP_1'].value_counts().sort_values(ascending=False)

Some 'ROLE_ROLLUP_1' values do have higher count than others, hence don't drop this column.

In [None]:
train_df['ROLE_TITLE'].value_counts().sort_values(ascending=False)

The feature **'ROLE_TITLE'** would probably lead to target leakage. The title is awarded after working for a few years in the company. But, the resources are provided when an employee joins (so this column can't possibly contribute towards our action). Therefore, we **drop this column** from both the datasets.

In [None]:
train_df = train_df.drop(columns='ROLE_TITLE')
test_df = test_df.drop(columns='ROLE_TITLE')

# train_df.head() # to see if it has been dropped successfully

In [None]:
train_df['MGR_ID'].value_counts().sort_values(ascending=False)

Not dropping for the same reason as 'ROLL_ROLLUP_1'

* ### **Splitting the Training Dataset for Validation**
    To try out different models and vary their parameters till we get the best tuning.
   

In [None]:
from sklearn.model_selection import train_test_split
y=train_df['ACTION']
X=train_df.drop('ACTION',axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.80,test_size=0.20, random_state=7)

# X_train.head() 
# y_train.head()

* ### **Choosing a model for our problem**
   Plan was to start with XGBoost, Random forests and then try others (for best fit) if time permits. But, I couldn't get time so I have    used **XGBoost**.  
   
   The [link](https://www.kaggle.com/alexisbcook/xgboost) of the article I used to learn and implement the model.

In [None]:
from xgboost import XGBRegressor

XGB_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
XGB_model.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_valid, y_valid)],verbose=False)

XGB_predictions = XGB_model.predict(X_valid) 


**Note on the parameters**  

 `n_estimators` specifies how many times to go through the modeling cycle (which is the heart of the model)i.e. the number  of models that we include in the ensemble. Basically, the model would stop after this number has been reached (at max). Not setting this parameter properly may result in cases of ***overfitting*** and ***underfitting***.
 
 `early_stopping_rounds` offers a way to automatically find the ideal value for `n_estimators`. The value of `early_stopping_rounds` is set to 5 (So, it would stop after we find deterioting results consecutively for 5 times). This way, I was able to set the value of `n_estimators` high enough without worrying about overfitting.    
 
 `learning_rate` parameter helps to ensure that each sub-model added to the ensemble helps us less (thus avoiding overfitting due to the contributions of deep-models).
 
 



In [None]:
# To validate our model
import sklearn.metrics 

auc = sklearn.metrics.roc_auc_score(y_valid, XGB_predictions)
print(auc)

I varied the parameters to arrive at the ones I have put by checking AUC.

* ### **Cross Validation**
  `cross_val_score()` function from `scikit-learn` can be used to generate the required folds for you.  
  There is enough data (~32k rows) so that the split doesn't result in any non-randomness. So there is no harm in skipping this step.   
  

* ### **Applying the model on the test dataset**

In [None]:
test_df.head()

In [None]:
X_test_final=test_df.drop('id',axis=1)
XCB_final_preds = XGB_model.predict(X_test_final)

* ### **Submission**  

In [None]:
XCB_output =  pd.DataFrame({'Id': 1+ X_test_final.index,
                       'ACTION': XCB_final_preds})
XCB_output.to_csv('XGB_submission.csv', index=False)

XCB_output.head() 