# STATS

In [16]:
import pandas as pd
from scipy import stats
import wrangle

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.sentiment

In [2]:
train, val, test = wrangle.wrangle_glassdoor()

In [3]:
train.head()

Unnamed: 0,url,pros,cons,name,rating,pros_cleaned,pros_lemmatized,cons_cleaned,cons_lemmatized,binned_rating,binned_rating_int
490,https://www.glassdoor.com/Reviews/Perficient-R...,Perficient is an ethical company that actually...,"None at all, love, love, love this company!\nI...",Perficient,4.1,perficient is an ethical company that actually...,perficient ethical company actually value empl...,none at all love love love this company\nit is...,none love love love company good company canno...,Four,4
273,https://www.glassdoor.com/Reviews/Farmers-Insu...,"This company is the best ever.\nLarge, establi...",I have nothing bad to say.\nManagement company...,Farmers Insurance Group,3.4,this company is the best ever\nlarge establish...,company best ever large established company so...,i have nothing bad to say\nmanagement company ...,nothing bad say management company get paid re...,Three,3
30,https://www.glassdoor.com/Reviews/MIT-Reviews-...,"Very inspiring place to work at, to feel that ...",Depends on the project to how much organizatio...,MIT,4.4,very inspiring place to work at to feel that s...,inspiring place work feel something new happen...,depends on the project to how much organizatio...,depends project much organization team include...,Four,4
406,https://www.glassdoor.com/Reviews/Morningstar-...,"- Coworkers are amicable, and they're overall ...",- Base pay for the area could be slightly high...,Morningstar,4.1,coworkers are amicable and they ' re overall v...,coworkers amicable ' overall supportive unlimi...,base pay for the area could be slightly higher...,base pay area could slightly higher bonus prog...,Four,4
163,https://www.glassdoor.com/Reviews/ICF-Reviews-...,Loved the job and the people. Great flexibilit...,"None, I would fully recommend\nThere was disho...",ICF,3.8,loved the job and the people great flexibility...,loved job people great flexibility fun project...,none i would fully recommend\nthere was dishon...,none would fully recommend dishonest hidden in...,Three,3


# Significance of words (tf-idf)

The IDF (Inverse Document Frequency) score is calculated to measure the importance of a word within a collection of documents. The IDF score indicates how rare or common a word is across the entire corpus.

- A higher IDF score suggests that a word is more unique and significant within the collection of documents.

In [4]:
# # Initialize a TF-IDF vectorizer and fit it to your documents
# tfidf_vectorizer = TfidfVectorizer()
# tfidf_matrix = tfidf_vectorizer.fit_transform(train['pros_lemmatized'] + train['cons_lemmatized'])
# tfidf_matrix

In [5]:
# # Get the TF-IDF scores for each word:
# feature_names = tfidf_vectorizer.get_feature_names_out()
# tfidf_scores = tfidf_matrix.toarray()
# tfidf_scores

In [6]:
# # Create a DataFrame with TF-IDF scores
# tfidf_df = pd.DataFrame(tfidf_scores, columns=feature_names)
# tfidf_df

**Calculate the TF score**

In [7]:
documents = {
    'pros': " ".join(train.pros_lemmatized.values),
    'cons': " ".join(train.cons_lemmatized.values),
}

# Create an empty list to store the TF dataframes
tfs = []

# Iterate through documents and their corresponding text
for doc, text in documents.items():
    # Split the text into words, count their occurrences, and reset the index
    word_counts = pd.Series(text.split()).value_counts().reset_index()

    # Rename the columns for clarity and calculate the term frequency (TF)
    tf_df = word_counts.rename(columns={'index': 'word', 0: 'word_count'})
    tf_df["tf"] = tf_df.word_count / len(text.split())
    tf_df = tf_df.assign(doc = doc)

    # Append the TF dataframe to the list
    tfs.append(tf_df)

**Calculate IDF score**

In [9]:
def idf(word):
    """
    calculates the Inverse Document Frequency (IDF) for a given word in a collection of documents.
    """
    n_occurences = sum([1 for doc in documents.values() if word in doc])
    return len(documents) / (n_occurences + 1)

In [13]:
# Calculate the if-idf score of each word and add to the if dataframe
tf_idf_scores = pd.concat(tfs, axis=0).assign(idf=lambda df: df.word.apply(idf)).assign(tf_idf=lambda df: df.idf * df.tf)

In [15]:
tf_idf_scores.head()

Unnamed: 0,word,word_count,tf,doc,idf,tf_idf
0,work,21492,0.036295,pros,0.666667,0.024197
1,great,20580,0.034755,pros,0.666667,0.02317
2,good,20196,0.034107,pros,0.666667,0.022738
3,benefit,11616,0.019617,pros,0.666667,0.013078
4,people,9423,0.015913,pros,0.666667,0.010609


**Add sentiment scores for each word**

In [18]:
# use polarity_scores from that object
sia = nltk.sentiment.SentimentIntensityAnalyzer()
# grab the sentiment from each of the texts as they stand
tf_idf_scores['sentiment'] = tf_idf_scores.word.apply(lambda doc: sia.polarity_scores(doc)['compound'])

In [19]:
tf_idf_scores.head()

Unnamed: 0,word,word_count,tf,doc,idf,tf_idf,sentiment
0,work,21492,0.036295,pros,0.666667,0.024197,0.0
1,great,20580,0.034755,pros,0.666667,0.02317,0.6249
2,good,20196,0.034107,pros,0.666667,0.022738,0.4404
3,benefit,11616,0.019617,pros,0.666667,0.013078,0.4588
4,people,9423,0.015913,pros,0.666667,0.010609,0.0


### Test documents together

1. **Does word count affect sentiment?**
   - Null Hypothesis (H0): There is no significant relationship between word count and sentiment.
   - Alternative Hypothesis (H1): There is a significant relationship between word count and sentiment.

2. **Is there a correlation between word frequency (tf) and sentiment?**
   - H0: There is no significant correlation between word frequency (tf) and sentiment.
   - H1: There is a significant correlation between word frequency (tf) and sentiment.

3. **Does the inverse document frequency (idf) of a word impact its sentiment?**
   - H0: There is no significant impact of inverse document frequency (idf) on sentiment.
   - H1: There is a significant impact of inverse document frequency (idf) on sentiment.

4. **Do words with higher tf-idf scores tend to have a specific sentiment?**
   - H0: There is no significant relationship between tf-idf scores and sentiment.
   - H1: There is a significant relationship between tf-idf scores and sentiment.

5. **Is there a significant difference in sentiment between different documents (doc) or groups of documents?**
   - H0: There is no significant difference in sentiment between documents or groups of documents.
   - H1: There is a significant difference in sentiment between documents or groups of documents.

6. **Do specific words have significantly different sentiment scores compared to the overall sentiment of the documents they appear in?**
   - H0: The sentiment of specific words is not significantly different from the overall sentiment of the documents they appear in.
   - H1: The sentiment of specific words is significantly different from the overall sentiment of the documents they appear in.

7. **Is there a significant difference in sentiment scores across different word categories or topics?**
   - H0: There is no significant difference in sentiment scores across word categories or topics.
   - H1: There is a significant difference in sentiment scores across word categories or topics.

8. **Does the sentiment of a document correlate with its length (word count)?**
   - H0: There is no significant correlation between document length and sentiment.
   - H1: There is a significant correlation between document length and sentiment.

9. **Is there a significant difference in sentiment scores across different documents (doc)?**
   - H0: There is no significant difference in sentiment scores across different documents.
   - H1: There is a significant difference in sentiment scores across different documents.

10. **Does sentiment vary significantly between documents with different levels of word frequency (tf)?**
    - H0: There is no significant difference in sentiment between documents with different levels of word frequency (tf).
    - H1: There is a significant difference in sentiment between documents with different levels of word frequency (tf).

### Test documents seperatly

1. **Is there a significant difference in sentiment between the two documents?**
   - H0: There is no significant difference in sentiment between the two documents.
   - H1: There is a significant difference in sentiment between the two documents.

2. **Do specific words have significantly different sentiment scores between the two documents?**
   - H0: The sentiment of specific words is not significantly different between the two documents.
   - H1: The sentiment of specific words is significantly different between the two documents.

3. **Is there a significant correlation between word frequency (tf) and sentiment within each document?**
   - For Document 1:
     - H0: There is no significant correlation between word frequency (tf) and sentiment within Document 1.
     - H1: There is a significant correlation between word frequency (tf) and sentiment within Document 1.

   - For Document 2:
     - H0: There is no significant correlation between word frequency (tf) and sentiment within Document 2.
     - H1: There is a significant correlation between word frequency (tf) and sentiment within Document 2.

4. **Is there a significant difference in sentiment scores between words in Document 1 and words in Document 2?**
   - H0: There is no significant difference in sentiment scores between words in Document 1 and words in Document 2.
   - H1: There is a significant difference in sentiment scores between words in Document 1 and words in Document 2.

5. **Does the sentiment of each document correlate with its respective word count?**
   - For Document 1:
     - H0: There is no significant correlation between the word count of Document 1 and its sentiment.
     - H1: There is a significant correlation between the word count of Document 1 and its sentiment.

   - For Document 2:
     - H0: There is no significant correlation between the word count of Document 2 and its sentiment.
     - H1: There is a significant correlation between the word count of Document 2 and its sentiment.

6. **Is there a significant difference in sentiment scores between the two documents based on their word count?**
   - H0: There is no significant difference in sentiment scores between the two documents based on their word count.
   - H1: There is a significant difference in sentiment scores between the two documents based on their word count.

7. **Do specific words have significantly different sentiment scores between Document 1 and Document 2 based on their tf-idf scores within each document?**
   - H0: The sentiment of specific words is not significantly different between Document 1 and Document 2 based on their tf-idf scores.
   - H1: The sentiment of specific words is significantly different between Document 1 and Document 2 based on their tf-idf scores.