### Business Context

To build text classificaton model using a dataset which contains what corporations actually talk about on social media. The statements were labelled as into following categories - `information` (objective statements about the company or it's activities), `dialog` (replies to users, etc.), or `action` (messages that ask for votes or ask users to click on links, etc.). Our aim is to build a model to automatically categorize the text into their respective categories. You can download the dataset from [here](https://data.world/crowdflower/corporate-messaging)

### Task 1: Understanding and loading the dataset

In [None]:
# load required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# load the dataset

data 

In [None]:
# see head of the dataset



In [None]:
# observe shape of the dataset



In [None]:
# check distribution of target column i.e. category



In [None]:
# check distribution of the column - category_confidence



In [None]:
# remove those observations where category_confidence < 1 and category = Exclude



In [None]:
# extract features i.e the column - text and target i.e the column - category



### Task 2: Text preprocessing

In [None]:
# let's observe a text in the dataset, extract the first text



In [None]:
# now extract the third text from this dataset



We will do the below pre-processing tasks on the text 
- tokenizing the sentences
- replace the urls with a placeholder
- removing non ascii characters
- text normalizing using lemmatization

In [None]:
# import re library for regular expressions


# import nltk library


# import stopwords from nltk library


# download the stopwords and wordnet corpus
nltk.download('wordnet')


# extract the english stopwords and save it to a variable


# import word_tokenize from nltk library


# import WordNetLemmatizer from nltk library


# write a regular expression to identify urls in text
url_regex = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

# write a regular expression to identify non-ascii characters in text
non_ascii_regex = r'[^\x00-\x7F]+'

# write a function to tokenize text after performing preprocessing 
def tokenize(text):
    
    # use library re to replace urls by token - urlplaceholder
    
    
    # use library re to replace non ascii characters by a space
      

    # use word_tokenize to tokenize the sentences
    
    
    # instantiate an object of class WordNetLemmatizer
    

    # use a list comprehension to lemmatize the tokens and remove the the stopwords
    

    # return the tokens
    return 

### Task 3: EDA

In this task, we will do exploratory data analysis to check if there is any new feature that we can generate based on the existing text that we have in the dataset

**Hypothesis 1:** The length of the text in each category might be different from each other
<br>**Hypothesis 2:** The total number of URLs that are present in text might be different in each category

In [None]:
# create a new column in the original dataset - 'length' to capture length of each text


# use seaborn boxplot to visualize the pattern in length for each category


In [None]:
# create a new column in the original dataset - 'url_count' to capture total count of urls present in each text


# use pandas crosstab to see the distibution of different url counts in each category


### Task 4: Creating custom transformers

An estimator is any object that learns from data, whether it's a classification, regression, or clustering algorithm, or a transformer that extracts or filters useful features from raw data. Since estimators learn from data, they each must have a `fit` method that takes a dataset.

There are two kinds of estimators - `Transformer Estimators` i.e. transformers in short and `Predictor Estimators` i.e. predictor in short. In transformers we also need to have another method `transform` and predictors need to have another method `predict`.

Some examples of `transformers` are - CountVectorizer, TfidfVectorizer, MinMaxScaler, StandardScaler etc

Some examples of `predictors` are - LinearRegression, LogisticRegression, RandomForestClassifier etc

In [None]:
# create a custom transformer LengthExtractor to extract length of each sentences



In [None]:
# create a custom transformer UrlCounter to count number of urls in each sentences



### Task 5: Model Building using FeatureUnion

Feature union applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

![alt text](pipeline.png "nlp pipeline")

In [None]:
# import RandomForestClassifier from sklearn


# import Pipeline and FeatureUnion from sklearn


# import CountVectorizer, TfidfTransformer from sklearn


In [None]:
# create an instance of Pipeline class

    
        # create a FeatureUnion pipeline
        

            # add a pipeline element to extract features using CountVectorizer and TfidfTransformer
            

            # add the pipeline element - LengthExtractor to extract lenght of each sentence as feature
            
            
            # add another pipeline element - UrlCounter to extract url counts in each sentence as feature
            

        # use the predictor estimator RandomForestClassifier to train the model
        


In [None]:
# split the data into train and test sets



In [None]:
# use pipeline.fit method to train the model



### Task 6: Model Evaluation

Now, once the model is trained, in this task we will evaluate how the model behaves in the test data

In [None]:
# use the method pipeline.predict on X_test data to predict the labels



In [None]:
# create the confustion matrix, import confusion_matrix from sklearn


# count the number of labels


# use sns.heatmap on top of confusion_matrix to show the confusuin matrix


In [None]:
# create the classification report, import classification_report from sklearn


# apply the function classification_report on y_test, y_pred and print it


### Task 7: Conclusion and next steps

How to improve this model - 

- hyper parameter tuning
- more feature engineering
- feature selection
- trying different predictors