<a href="https://colab.research.google.com/github/monicafar147/classification-predict-streamlit-template/blob/Preprocessing/preprocessing_comet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Comet ML  

Comet is a great tool for model versioning and experimentation as it records the parameters and conditions from each of your experiements- allowing you to reproduce your results, or go back to a previous version of your experiment.  

To create an account, visit https://www.comet.ml/  
Follow the instructions for a single user account. Once that is created, you will see a project folder. That is where the records of your experiments can be viewed. 

Comet has an abundance of tutorials and scripts, we're just going to run through this notebook to get you started on the right track. For this illustration, we will be using one of the examples found on the Comet ML GitHub repo.

To begin with, you should install as illustrated below if you don't already have it. *Always import Experiment at the top of your notebook/script.*


In [1]:
!pip install comet_ml



You will see an API key button at the top of the page when you click on an experiment- use this key as illustrated below to link your current workspace to comet. (If a project is empty, the code below will autogenerate for you on the project page, just copy and paste it in here)

In [2]:
# import comet_ml in the top of your file
from comet_ml import Experiment
    
# Add the following code anywhere in your machine learning file
experiment = Experiment(api_key="rBqQ3hDuEa6xVpT9ns5Tz1dVt",
                        project_name="nlp-climate-change", workspace="monicafar147")

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/monicafar147/nlp-climate-change/e94f19ab196e422e800b324adba6ec81



Import the rest of your necessary libraries as you usually would. For this demonstration we will be using the breast cancer dataset for classification so we will also import that from sklearn.

In [3]:
import numpy as np 
import pandas as pd

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-deep')

# text preprocessing
import re
import string
import contractions
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import Word
from wordcloud import WordCloud, STOPWORDS
from string import punctuation 

# models
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

nltk.download('stopwords')

  import pandas.util.testing as tm


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
train = pd.read_csv("https://raw.githubusercontent.com/monicafar147/classification-predict-streamlit-template/master/climate-change-belief-analysis/train.csv")

In [5]:
train['message']

0        PolySciMajor EPA chief doesn't think carbon di...
1        It's not like we lack evidence of anthropogeni...
2        RT @RawStory: Researchers say we have three ye...
3        #TodayinMaker# WIRED : 2016 was a pivotal year...
4        RT @SoyNovioDeTodas: It's 2016, and a racist, ...
                               ...                        
15814    RT @ezlusztig: They took down the material on ...
15815    RT @washingtonpost: How climate change could b...
15816    notiven: RT: nytimesworld :What does Trump act...
15817    RT @sara8smiles: Hey liberals the climate chan...
15818    RT @Chet_Cannon: .@kurteichenwald's 'climate c...
Name: message, Length: 15819, dtype: object

In [6]:
def _preprocess(data):
  df = data.copy()

  # apply lowercase to data
  data['message'] = data['message'].apply(lambda word: ''.join(word.lower()))

  # function to remove contraction
  def remove_contraction(row):
    fixed = [contractions.fix(word) for word in row.split()]
    return ' '.join(map(str,fixed))

  # replace contractions
  df['message'] = np.vectorize(remove_contraction)(df['message'])

  # function to remove patterns
  def remove_pattern(text,pattern,replacement=''):
    remove_this = re.findall(pattern, text)
    for item in remove_this:
      text = re.sub(item, replacement, text)
    return text

  # remove URL
  df['message'] = df['message'].apply(lambda word: re.split('https:\/\/.*', str(word))[0])

  # remove punctuation
  df['message'] = df['message'].apply(lambda word: word.translate(str.maketrans('', '', string.punctuation)))

  # remove stopwords
  stop_words = stopwords.words('english')
  data['message'] = data['message'].apply(lambda word: ' '.join(word for word in word.split() if word not in stop_words))

  # remove retweet as rt
  df['message'] = np.vectorize(remove_pattern)(df['message'],"RT[\w]*")
  return df

In [7]:
def _preprocess_V2(tweet):
    stopwords_list = set(stopwords.words('english') + list(punctuation))
    tweet = tweet.lower() # convert text to lower-case
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub(r"\W", " ", tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    tweet = word_tokenize(tweet) # remove repeated characters (helloooooooo into hello)
    tweets = [word for word in tweet if word not in stopwords_list]
    return " ".join(tweets) 

In [8]:
def _preprocess_V3(tweet):
  tweet = tweet.lower()
  tweet = re.sub(r"\W", " ", tweet)
  tweet = re.sub(r'#([^\s]+)', r'\1', tweet) 
  tweet = word_tokenize(tweet)
  stopwords_list = set(stopwords.words('english') + list(punctuation))
  tweets = [word for word in tweet if word not in stopwords_list]
  return " ".join(tweet)

In [9]:
def _preprocess_V4(df):
  def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])
  df['message'] = df['message'].apply(remove_punctuation)
  df_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
  subs_url = r'url-web'
  df['message'] = df['message'].replace(to_replace = df_url, value = subs_url, regex = True)
  df['message'] = df['message'].str.lower()
  return df

In [10]:
# Splitting the labels and features
train_processed = _preprocess_V4(train)
X = train_processed['message']
y = train['sentiment']

In [11]:
X[0]

'polyscimajor epa chief doesnt think carbon dioxide is main cause of global warming and wait what httpstcoyelvcefxkc via mashable'

Split your data into train and test sets, keep in mind that you need to set a random state for your results to be reproduced!

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1,random_state=42)

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
# apply model on train data using Linear SVC:
svc = Pipeline([('tfidf',TfidfVectorizer()),('classify',LinearSVC())])
svc.fit(X_train, y_train)

#apply model on test data
y_pred = svc.predict(X_test)

## Results

Now that our model has trained, we can have a look at the results- Below is a confusion matrix indicating that at first glance, we have a fairly good model going. We then save the F1 score, Precision, and Recall as individual variables to go into our metric dictionary for logging.

P.S. have a look at the Comet tutorial page for interesting confusion matrix plots.

In [15]:
from sklearn.metrics import classification_report, confusion_matrix
print("\nResults\nConfusion matrix \n {}".format(
    confusion_matrix(y_test, y_pred)))


Results
Confusion matrix 
 [[ 63  16  42   5]
 [ 12  80 109  23]
 [ 15  39 781  60]
 [  3   8  66 260]]


In [16]:
# Saving each metric to add to a dictionary for logging
report = classification_report(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)

In [17]:
# Create dictionaries for the data we want to log

params = {"preprocessing":  "_preprocess_V4(df)",
          "keeps username":"True",
          "keeps hashtags":"True",
          "keeps URL":"urlweb",
          "removes puncutation":"string punctuation",
          "use stopwords":"False",
          "model_type": "LinearSVC",
          }

metrics = {"report" : report,
           }

In [18]:
# Log our parameters and results
experiment.log_parameters(params)
experiment.log_metric("report",report)

If you're using comet within a jupyter notebook, it's important to end your experiment when you've finished as illustrated below.

In [19]:
experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/monicafar147/nlp-climate-change/e94f19ab196e422e800b324adba6ec81
COMET INFO:   Metrics:
COMET INFO:     report :               precision    recall  f1-score   support

          -1       0.68      0.50      0.58       126
           0       0.56      0.36      0.44       224
           1       0.78      0.87      0.83       895
           2       0.75      0.77      0.76       337

    accuracy                           0.75      1582
   macro avg       0.69      0.63      0.65      1582
weighted avg       0.74      0.75      0.74      1582

COMET INFO:   Parameters:
COMET INFO:     classify_C                 : 1.0
COMET INFO:     classify_class_weight      : 1
COMET INFO:     classify_dual              : True
COMET INFO:     classify_fit_intercep

## Display  

Running `experiment.display()` will show you your experiments comet.ml page inside your notebook as illustrated below. You can do this immediately after an experiment is run, and logged. 

In [20]:
experiment.display()