# Pandas showcase for simple Data Analysis (for beginners)

The goal for this blog post is to get you interested in Python and Pandas. Showing you that you can do data cleaning and analysis without going to Excel, much more faster!

Data Source:
- https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey

In [109]:
import string
from collections import Counter

import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize


## 1. Text Preparation

In [64]:
df = pd.read_csv('data/survey.csv')

In [67]:
df.head(20)

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,
5,2014-08-27 11:31:22,33,Male,United States,TN,,Yes,No,Sometimes,6-25,...,Don't know,No,No,Yes,Yes,No,Maybe,Don't know,No,
6,2014-08-27 11:31:50,35,Female,United States,MI,,Yes,Yes,Sometimes,1-5,...,Somewhat difficult,Maybe,Maybe,Some of them,No,No,No,Don't know,No,
7,2014-08-27 11:32:05,39,M,Canada,,,No,No,Never,1-5,...,Don't know,No,No,No,No,No,No,No,No,
8,2014-08-27 11:32:39,42,Female,United States,IL,,Yes,Yes,Sometimes,100-500,...,Very difficult,Maybe,No,Yes,Yes,No,Maybe,No,No,
9,2014-08-27 11:32:43,23,Male,Canada,,,No,No,Never,26-100,...,Don't know,No,No,Yes,Yes,Maybe,Maybe,Yes,No,


In [73]:
df['clean_comments'] = df['comments'].fillna('')
df.head(20)

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,clean_comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,No,No,Some of them,Yes,No,Maybe,Yes,No,,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,Maybe,No,No,No,No,No,Don't know,No,,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,No,No,Yes,Yes,Yes,Yes,No,No,,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,No,No,Some of them,Yes,Yes,Yes,Don't know,No,,
5,2014-08-27 11:31:22,33,Male,United States,TN,,Yes,No,Sometimes,6-25,...,No,No,Yes,Yes,No,Maybe,Don't know,No,,
6,2014-08-27 11:31:50,35,Female,United States,MI,,Yes,Yes,Sometimes,1-5,...,Maybe,Maybe,Some of them,No,No,No,Don't know,No,,
7,2014-08-27 11:32:05,39,M,Canada,,,No,No,Never,1-5,...,No,No,No,No,No,No,No,No,,
8,2014-08-27 11:32:39,42,Female,United States,IL,,Yes,Yes,Sometimes,100-500,...,Maybe,No,Yes,Yes,No,Maybe,No,No,,
9,2014-08-27 11:32:43,23,Male,Canada,,,No,No,Never,26-100,...,No,No,Yes,Yes,Maybe,Maybe,Yes,No,,


In [59]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /home/lukasz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/lukasz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/lukasz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/lukasz/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/lukasz/nltk_data...


True

In [45]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

In [52]:
def get_wordnet_pos(treebank_tag):
    """Map POS tag to first character used by WordNetLemmatizer"""
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # by default, treat as noun

In [53]:
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens]
    tokens = [word for word in tokens if word.isalpha() or word in string.punctuation]
    pos_tags = nltk.pos_tag(tokens)
    tokens = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]
    tokens = [word for word in tokens if word not in stop_words]

    return " ".join(tokens)


In [74]:
df['processed_comments'] = df['clean_comments'].apply(preprocess_text)

In [76]:
df.head(20)

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,...,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,clean_comments,processed_comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,...,No,Some of them,Yes,No,Maybe,Yes,No,,,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,...,No,No,No,No,No,Don't know,No,,,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,...,No,Yes,Yes,Yes,Yes,No,No,,,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,...,Yes,Some of them,No,Maybe,Maybe,No,Yes,,,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,...,No,Some of them,Yes,Yes,Yes,Don't know,No,,,
5,2014-08-27 11:31:22,33,Male,United States,TN,,Yes,No,Sometimes,6-25,...,No,Yes,Yes,No,Maybe,Don't know,No,,,
6,2014-08-27 11:31:50,35,Female,United States,MI,,Yes,Yes,Sometimes,1-5,...,Maybe,Some of them,No,No,No,Don't know,No,,,
7,2014-08-27 11:32:05,39,M,Canada,,,No,No,Never,1-5,...,No,No,No,No,No,No,No,,,
8,2014-08-27 11:32:39,42,Female,United States,IL,,Yes,Yes,Sometimes,100-500,...,No,Yes,Yes,No,Maybe,No,No,,,
9,2014-08-27 11:32:43,23,Male,Canada,,,No,No,Never,26-100,...,No,Yes,Yes,Maybe,Maybe,Yes,No,,,


## 2. Analyze data

### 2.1. Sentiment

In [60]:
sia = SentimentIntensityAnalyzer()

In [61]:
def get_sentiment(text):
    sentiment_score = sia.polarity_scores(text)['compound']
    if sentiment_score >= 0.05:
        return "positive"
    elif sentiment_score <= -0.05:
        return "negative"
    else:
        return "neutral"

In [77]:
df['sentiment_comments'] = df['processed_comments'].apply(get_sentiment)

In [80]:
cols = ['comments', 'clean_comments', 'processed_comments', 'sentiment_comments']
df[cols].head(20)

Unnamed: 0,comments,clean_comments,processed_comments,sentiment_comments
0,,,,neutral
1,,,,neutral
2,,,,neutral
3,,,,neutral
4,,,,neutral
5,,,,neutral
6,,,,neutral
7,,,,neutral
8,,,,neutral
9,,,,neutral


In [98]:
df_sentiment = df[cols]

negatives_df = df_sentiment[df_sentiment['sentiment_comments'] == 'negative'][['comments', 'processed_comments']]
negatives = negatives_df['comments'].tolist()

positives_df = df_sentiment[df_sentiment['sentiment_comments'] == 'positive'][['comments', 'processed_comments']]
positives = positives_df['comments'].tolist()


In [99]:
for i in range(3):
    print("{}\n".format(negatives[i]))

Our health plan has covered my psychotherapy and my antidepressant medication. My manager has been aware but discreet throughout. I did get negative reviews when my depression was trashing my delivery but y'know I wasn't delivering.

In addition to my own mental health issues I've known several coworkers that may be suffering and I don't know how to tell them I empathize and that I want to help.

In my previous workplace which had mental health protections policies and access to counsellors my Director went so far as to say to me in somewhat casual conversation A woman was murdered across the street. At best though she was bipolar and at worst - who knowsI have bipolar disorder. I have zero faith that an organization with policies in place could appropriately handle mental health. I have even less faith that a workplace without the policies in place could appropriately handle mental health. I can only imagine it's worse in full tech environments.


In [100]:
for i in range(3):
    print("{}\n".format(positives[i]))

I have chronic low-level neurological issues that have mental health side effects. One of my supervisors has also experienced similar neurological problems so I feel more comfortable being open about my issues than I would with someone without that experience. 

Sometimes I think  about using drugs for my mental health issues. If i use drugs I feel better

I selected my current employer based on its policies about self care and the quality of their overall health and wellness benefits. I still have residual caution from previous employers who ranged from ambivalent to indifferent to actively hostile regarding mental health concerns.


### 2.2 Frequency

In [106]:
def most_common(df, top_n=10):
    all_words = [word for word in ' '.join(df['processed_comments']).split() if word not in string.punctuation]
    word_counts = Counter(all_words)
    top_10_words = word_counts.most_common(top_n)

    return top_10_words

In [107]:
print(most_common(negatives_df))

[('health', 45), ('mental', 41), ('issue', 22), ('employer', 21), ('depression', 20), ('work', 19), ('would', 15), ('bad', 12), ('people', 12), ('insurance', 12)]


In [108]:
print(most_common(positives_df))

[('health', 63), ('mental', 55), ('work', 36), ('issue', 30), ('employer', 26), ('company', 25), ('would', 21), ('know', 17), ('people', 17), ('get', 15)]
