<a id='Q0'></a>
<center> <h1> POSTme!: Tool for Social Media Messaging Optimization </center>
<center> <h2> Notebook : Part of Speech Tagging </h2> </center>
<p style="margin-bottom:1cm;"></p>
<p style="margin-bottom:1cm;"></p>

<div style="background:#EEEDF5;border-top:0.1cm solid #EF475B;border-bottom:0.1cm solid #EF475B;">
    <div style="margin-left: 0.5cm;margin-top: 0.5cm;margin-bottom: 0.5cm;color:#303030">
        <p><strong>Goal:</strong> Parts of speech (POS) are specific lexical categories to which words are assigned, based on their syntactic context and role. Usually, words can fall into one of the following major categories. The process of classifying and labeling POS tags for words called parts of speech tagging or POS tagging. Knowledge about the structure and syntax of language is helpful in many areas like text processing, annotation, and parsing for further operations such as text classification or summarization. The aim of this notebook is to calculate the number of certain POS tags. </p>
        

[source](https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72)

### Import Dependencies


In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm

Collecting contractions
  Downloading contractions-0.1.66-py2.py3-none-any.whl (8.0 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[?25l[K     |█▏                              | 10 kB 17.4 MB/s eta 0:00:01[K     |██▎                             | 20 kB 24.6 MB/s eta 0:00:01[K     |███▌                            | 30 kB 30.7 MB/s eta 0:00:01[K     |████▋                           | 40 kB 35.1 MB/s eta 0:00:01[K     |█████▊                          | 51 kB 25.2 MB/s eta 0:00:01[K     |███████                         | 61 kB 28.3 MB/s eta 0:00:01[K     |████████                        | 71 kB 28.1 MB/s eta 0:00:01[K     |█████████▏                      | 81 kB 23.0 MB/s eta 0:00:01[K     |██████████▍                     | 92 kB 25.0 MB/s eta 0:00:01[K     |███████████▌                    | 102 kB 27.0 MB/s eta 0:00:01[K     |████████████▊     

In [None]:
import nltk
import contractions
import re
import tqdm

import pandas as pd
import numpy as np

import textblob

from google.colab import drive #to import and export data directly

In [None]:
nltk.download("punkt")
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

# POS tagging

For this task, we will only count the number of specific parts of speech and categorize them under certain groups. As an example, singular nouns (NN), plural nouns(NNS), singular proper nouns(NNP), and plural proper nouns (NNPS) will be categorized as nouns. 

## List of Part-of-Speech Tags

| POS Tag | Description | Example |
|---------|---------------------------------------|-----------------------------------------|
| CC | coordinating conjunction | and |
| CD | cardinal number | 1, third |
| DT | determiner | the |
| EX | existential there | there is |
| FW | foreign word | d’hoevre |
| IN | preposition/subordinating conjunction | in, of, like |
| JJ | adjective | big |
| JJR | adjective, comparative | bigger |
| JJS | adjective, superlative | biggest |
| LS | list marker | 1) |
| MD | modal | could, will |
| NN | noun, singular or mass | door |
| NNS | noun plural | doors |
| NNP | proper noun, singular | John |
| NNPS | proper noun, plural | Vikings |
| PDT | predeterminer | both the boys |
| POS | possessive ending | friend‘s |
| PRP | personal pronoun | I, he, it |
| PRP\$ | possessive pronoun | my, his |
| RB | adverb | however, usually, naturally, here, good |
| RBR | adverb, comparative | better |
| RBS | adverb, superlative | best |
| RP | particle | give up |
| TO | to | to go, to him |
| UH | interjection | uhhuhhuhh |
| VB | verb, base form | take |
| VBD | verb, past tense | took |
| VBG | verb, gerund/present participle | taking |
| VBN | verb, past participle | taken |
| VBP | verb, sing. present, non-3d | take |
| VBZ | verb, 3rd person sing. present | takes |
| WDT | wh-determiner | which |
| WP | wh-pronoun | who, what |
| WP\$ | possessive wh-pronoun | whose |
| WRB | wh-abverb | where, when |

[Source](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [None]:
pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' :  ['RB','RBR','RBS','WRB']
}

In order to mark up the words in a text(corpus) as corresponding to a particular part of speech we use the TextBlob library. TextBlob is an open-source library for processing textual data and performing NLP tasks with ease. Check out the documentation here: [TextBlob](https://textblob.readthedocs.io/en/dev/)



In [None]:
# function to check and get the part of speech tag count of a words in a given sentence
# note this may take some time to execute on larger corpora

def check_pos_tag(x, flag):
    cnt = 0
    try:
        wiki = textblob.TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_family[flag]:
                cnt += 1
    except:
        pass
    return cnt

### Twitter Data

In [None]:
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268 entries, 0 to 267
Data columns (total 23 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Unnamed: 0                                   268 non-null    int64  
 1   data_id                                      268 non-null    int64  
 2   data_text                                    268 non-null    object 
 3   data_non_public_metrics_impression_count     268 non-null    int64  
 4   data_non_public_metrics_url_link_clicks      210 non-null    float64
 5   data_non_public_metrics_user_profile_clicks  268 non-null    int64  
 6   data_organic_metrics_impression_count        268 non-null    int64  
 7   data_organic_metrics_like_count              268 non-null    int64  
 8   data_organic_metrics_reply_count             268 non-null    int64  
 9   data_organic_metrics_retweet_count           268 non-null    int64  
 10  da

In [None]:
feature_col = 'data_text'

In [None]:
df_twitter['noun_count'] = df_twitter[feature_col].apply(lambda x: check_pos_tag(x, 'noun'))

In [None]:
df_twitter['verb_count'] = df_twitter[feature_col].apply(lambda x: check_pos_tag(x, 'verb'))

In [None]:
df_twitter['adj_count'] = df_twitter[feature_col].apply(lambda x: check_pos_tag(x, 'adj'))

In [None]:
df_twitter['adv_count'] = df_twitter[feature_col].apply(lambda x: check_pos_tag(x, 'adv'))

In [None]:
df_twitter['pron_count'] = df_twitter[feature_col].apply(lambda x: check_pos_tag(x, 'pron'))

In [None]:
twitter_pos = df_twitter[['data_id','noun_count', 'verb_count','adj_count', 'adv_count', 'pron_count']]

In [None]:
twitter_pos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 268 entries, 0 to 267
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   data_id     268 non-null    int64
 1   noun_count  268 non-null    int64
 2   verb_count  268 non-null    int64
 3   adj_count   268 non-null    int64
 4   adv_count   268 non-null    int64
 5   pron_count  268 non-null    int64
dtypes: int64(6)
memory usage: 12.7 KB


### Linkedin Data

In [None]:
df_linkedin.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   273 non-null    int64  
 1   DateTime                     273 non-null    object 
 2   LinkeInPage                  273 non-null    object 
 3   DataID                       273 non-null    int64  
 4   Tags                         97 non-null     object 
 5   PostCampaign                 0 non-null      float64
 6   PostType                     273 non-null    object 
 7   PostMessage                  273 non-null    object 
 8   Reactions                    273 non-null    int64  
 9   Impressions                  273 non-null    int64  
 10  Engagement                   273 non-null    float64
 11  Clicks                       273 non-null    int64  
 12  Shares                       273 non-null    int64  
 13  Comments            

In [None]:
feature_col = "PostMessage"

In [None]:
df_linkedin['noun_count'] = df_linkedin[feature_col].apply(lambda x: check_pos_tag(x, 'noun'))

In [None]:
df_linkedin['verb_count'] = df_linkedin[feature_col].apply(lambda x: check_pos_tag(x, 'verb'))

In [None]:
df_linkedin['adj_count'] = df_linkedin[feature_col].apply(lambda x: check_pos_tag(x, 'adj'))

In [None]:
df_linkedin['adv_count'] = df_linkedin[feature_col].apply(lambda x: check_pos_tag(x, 'adv'))

In [None]:
df_linkedin['pron_count'] = df_linkedin[feature_col].apply(lambda x: check_pos_tag(x, 'pron'))

In [None]:
linkedin_pos = df_linkedin[['DataID','noun_count', 'verb_count','adj_count', 'adv_count', 'pron_count']]

In [None]:
linkedin_pos = linkedin_pos.rename(columns={"DataID":"data_id"})

In [None]:
linkedin_pos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   data_id     273 non-null    int64
 1   noun_count  273 non-null    int64
 2   verb_count  273 non-null    int64
 3   adj_count   273 non-null    int64
 4   adv_count   273 non-null    int64
 5   pron_count  273 non-null    int64
dtypes: int64(6)
memory usage: 12.9 KB
