<a href="https://colab.research.google.com/github/katkasian/mysql-csv/blob/master/most_common_positive_adjectives.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**COMING UP WITH A SHORT LIST OF POSITIVE ADJECTIVES** 

This is a preparatory stage of my SQLite analysis of the Yelp dataset. I wanted to perform a sentiment analysis of reviews users have written to see if users who write more positive reviews have more followers. As sqlite does not support list-like datatypes, I needed a short list of common and strongly positive words (in this case, adjectives only). The below code uses Python libraries (pandas, nltk) and a SentiWordNet dataset to find most common strongly positive words. Commonality is operationaly defined as frequency of usage in reviews found in Brown corpus. 

In [135]:
#importing necessary libraries, downloading brown corpus to use in google colab
import pandas as pd
from nltk import FreqDist
import nltk
nltk.download('brown')
from  nltk.corpus import brown

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [139]:
#reading SentiWordNet dataset as Pandas dataframe
df = pd.read_csv("https://raw.githubusercontent.com/TharinduMunasinge/Twitter-Sentiment-Analysis/master/sentiwordnet.csv", sep = "\t")
cols =  ["POS", "pos_score", "neg_score", "term"]
df.columns = cols
df.head()

Unnamed: 0,POS,pos_score,neg_score,term
0,a,0.0,0.75,unable#1
1,a,0.0,0.0,dorsal#2 abaxial#1
2,a,0.0,0.0,ventral#2 adaxial#1
3,a,0.0,0.0,acroscopic#1
4,a,0.0,0.0,basiscopic#1


In [0]:
#creating two dataframes with strongly negative and strongly positive adjectives
def adj(col):
  """A function that returns a dataframe with highest scoring negative or positive adjectives. Enter the score column"""
  high_words = df.term[(col >= 0.8) & (df.POS == "a")]
  high_score =  col[(col >= 0.8) & (df.POS == "a")]
  high_df = pd.DataFrame({"word":high_words, "score":high_score})
  high_df["word"] = high_df.word.str.replace("#.+", "")
  return high_df

In [140]:
pos_df = adj(df.pos_score)
pos_df

Unnamed: 0,word,score
108,veracious,0.875
535,selfless,0.875
632,perked_up,0.875
898,attractive,0.875
903,piquant,0.875
...,...,...
12984,topping,1.000
13245,esthetic,0.875
13936,virtuous,0.875
14192,salubrious,0.875


In [0]:
#deteriming most common adjectives with Brown corpus
def common_adjs():
  brown.tagged_words(categories = 'reviews')
  adj = [word.lower() for word, pos in brown.tagged_words(categories = 'reviews') if pos == 'JJ']
  fdist = FreqDist(adj)
  common = [word for word in fdist.keys() if fdist[word] > 5]
  return common

In [0]:
commonads = common_adjs()

In [0]:
#determining strongly positive common adjectives
strong_pos = set(list(pos_df[pos_df.word.isin(commonads)].word))

In [144]:
strong_pos

{'attractive', 'charming', 'good', 'intellectual', 'nice', 'solid', 'superb'}