Importing data from Google Sheets

Referencing snippet from here: https://colab.research.google.com/notebooks/snippets/sheets.ipynb#scrollTo=JiJVCmu3dhFa 

In [1]:
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

In [2]:
worksheet = gc.open('allSonyGlassdoor').sheet1

In [3]:
rows = worksheet.get_all_values()
print(rows)

[['title', 'author_info', 'rating', 'pros', 'cons', 'helpful'], ['Excellent company', 'Jun 1, 2022 - Service Engineer in Miami, FL', '5', 'Excellent company to work for', 'None to share, great environment to grow', 'Be the first to find this review helpful'], ['friendly environment', 'May 28, 2022 - Program Manager', '5', 'colleges are friendly and supportive. work-life balanced.', "There's not much room for young people to grow up", 'Be the first to find this review helpful'], ['Great Company', 'May 16, 2022 - Summer Intern in Herndon, VA', '5', 'I learned a lot in this company. The team is amazing and it is very supportive. a lot of flexibility.', 'Sometimes it is hard to make connections with other people.', 'Be the first to find this review helpful'], ['Sony review', 'May 11, 2022 - Anonymous Employee', '5', 'Good pay really solid culture', 'its not flexible depending on department', 'Be the first to find this review helpful'], ['no comments', 'May 24, 2022 - Senior Research Engine

Converting the spreadsheet to a Pandas Dataframe

In [4]:
import pandas as pd

In [5]:
sony_df = pd.DataFrame.from_records(rows)
display(sony_df)

Unnamed: 0,0,1,2,3,4,5
0,title,author_info,rating,pros,cons,helpful
1,Excellent company,"Jun 1, 2022 - Service Engineer in Miami, FL",5,Excellent company to work for,"None to share, great environment to grow",Be the first to find this review helpful
2,friendly environment,"May 28, 2022 - Program Manager",5,colleges are friendly and supportive. work-lif...,There's not much room for young people to grow up,Be the first to find this review helpful
3,Great Company,"May 16, 2022 - Summer Intern in Herndon, VA",5,I learned a lot in this company. The team is a...,Sometimes it is hard to make connections with ...,Be the first to find this review helpful
4,Sony review,"May 11, 2022 - Anonymous Employee",5,Good pay really solid culture,its not flexible depending on department,Be the first to find this review helpful
...,...,...,...,...,...,...
2003,Good company but there are better salaries and...,"Jun 4, 2009 - Software Engineer in Lund, Skåne",4,Lots of own responsibility. Managment in swede...,hard to correct bad salary and hard to advance...,1 person found this review helpful
2004,The Sun Sets on Sony,"Mar 20, 2009 - Director in Basingstoke, England",3,Lots of cheap toys in the staff shop Kind cult...,Long hours Inflexible Poor leaders Japanese ge...,Be the first to find this review helpful
2005,Sony Ericsson has amazing employees.,"Feb 24, 2009 - General Manager in Lund, Skåne",5,The people are extremely passionate about thei...,Sony uses it as either a place to send old exe...,Be the first to find this review helpful
2006,"It's an allright place to work, probably one o...","Dec 22, 2008 - Research Engineer",3,"Access to research resources seem good, not a ...",Some internal supervision is poor. Got hired a...,Be the first to find this review helpful


In [6]:
# Designating the first row of the dataframe as the header
sony_df.columns = sony_df.iloc[0]
sony_df = sony_df[1:]
sony_df.head()

Unnamed: 0,title,author_info,rating,pros,cons,helpful
1,Excellent company,"Jun 1, 2022 - Service Engineer in Miami, FL",5,Excellent company to work for,"None to share, great environment to grow",Be the first to find this review helpful
2,friendly environment,"May 28, 2022 - Program Manager",5,colleges are friendly and supportive. work-lif...,There's not much room for young people to grow up,Be the first to find this review helpful
3,Great Company,"May 16, 2022 - Summer Intern in Herndon, VA",5,I learned a lot in this company. The team is a...,Sometimes it is hard to make connections with ...,Be the first to find this review helpful
4,Sony review,"May 11, 2022 - Anonymous Employee",5,Good pay really solid culture,its not flexible depending on department,Be the first to find this review helpful
5,no comments,"May 24, 2022 - Senior Research Engineer",2,work with smart people work with cutting edge ...,toxic working environment low paid terrible WLB,Be the first to find this review helpful


Cleaning up the dataframe by...

- Removing the `helpful` (last) column, which indicates how many Glassdoor users rated a review as "helpful." This information is not relevant to us.
- Parsing the date from the `author_info` (second) column. While job titles are not always provided – as review authors are not obligated by Glassdoor to state them — they are spliced out whenever they are. This allows us to only access the information we need: the date the review was posted.

In [7]:
# Remove last column by its title
sony_df = sony_df.drop('helpful', 1)

  


In [8]:
import datetime

In [9]:
# Helper function for date formatting (MM/DD/YY).
# i.e. Takes "Jan 1, 2000" as input, and returns "01/01/2000" as output.
# Note that both I/O are strings.
def format_date(original_date):

  date_components = original_date.split(' ')

  # Convert the month from abbreviated to numerical format.
  # Pad zeroes wherever appropriate.
  month_published = str(datetime.datetime.strptime(date_components[0], "%b").month).zfill(2)

  # Remove the trailing comma from the day (second item in list).
  # Again, pad zeroes wherever appropriate.
  day_published = date_components[1][0:-1].zfill(2)

  year_published = date_components[2]

  date_formatted = month_published + '/' + day_published + '/' + year_published
  return date_formatted

In [10]:
sony_df.head()

Unnamed: 0,title,author_info,rating,pros,cons
1,Excellent company,"Jun 1, 2022 - Service Engineer in Miami, FL",5,Excellent company to work for,"None to share, great environment to grow"
2,friendly environment,"May 28, 2022 - Program Manager",5,colleges are friendly and supportive. work-lif...,There's not much room for young people to grow up
3,Great Company,"May 16, 2022 - Summer Intern in Herndon, VA",5,I learned a lot in this company. The team is a...,Sometimes it is hard to make connections with ...
4,Sony review,"May 11, 2022 - Anonymous Employee",5,Good pay really solid culture,its not flexible depending on department
5,no comments,"May 24, 2022 - Senior Research Engineer",2,work with smart people work with cutting edge ...,toxic working environment low paid terrible WLB


In [11]:
def format_dates_in_df(df_original):
  df = df_original.copy()

  # Extract the date from the last column
  for index, row in df.iterrows():

    delimiter = ' - '
    split_info = row['author_info'].split(delimiter)
    
    date_published = ''

    # If a job title was provided by the reviewer, we splice it out.
    if len(split_info) > 1:
      date_published = split_info[0]

    # If no job title was provided, then the date is simply
    # the first item in the list, with the trailing space & hyphen excluded.
    # So exclude the last two characters.
    else:
      date_published = split_info[0][0:-2]
    
    # Format the date, relying on the helper function above.
    date_formatted = format_date(date_published)
    
    # Update the dataframe.
    df.loc[index, 'author_info'] = date_formatted
  
  return df

In [12]:
formatted_sony_df = format_dates_in_df(sony_df)

In [None]:
# # Extract the date from the last column
# for index, row in sony_df.iterrows():

#   delimiter = ' - '
#   split_info = row['author_info'].split(delimiter)
  
#   date_published = ''

#   # If a job title was provided by the reviewer, we splice it out.
#   if len(split_info) > 1:
#     date_published = split_info[0]

#   # If no job title was provided, then the date is simply
#   # the first item in the list, with the trailing space & hyphen excluded.
#   # So exclude the last two characters.
#   else:
#     date_published = split_info[0][0:-2]
  
#   # Format the date, relying on the helper function above.
#   print(date_published)
#   date_formatted = format_date(date_published)
#   # date_formatted = date_published
  
#   # Update the dataframe.
#   sony_df.loc[index, 'author_info'] = date_formatted

In [13]:
formatted_sony_df.head()

Unnamed: 0,title,author_info,rating,pros,cons
1,Excellent company,06/01/2022,5,Excellent company to work for,"None to share, great environment to grow"
2,friendly environment,05/28/2022,5,colleges are friendly and supportive. work-lif...,There's not much room for young people to grow up
3,Great Company,05/16/2022,5,I learned a lot in this company. The team is a...,Sometimes it is hard to make connections with ...
4,Sony review,05/11/2022,5,Good pay really solid culture,its not flexible depending on department
5,no comments,05/24/2022,2,work with smart people work with cutting edge ...,toxic working environment low paid terrible WLB


### Retroactive date handling (pt. 1)

In [14]:
import nltk
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from pprint import pprint

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [15]:
lemmatizer = WordNetLemmatizer()

def tokenizeLemmatize(reviews):
  temp = []
  for sentence in reviews:
    tokens = word_tokenize(sentence)
    cleanedSentence = ""
    for token in tokens:
        lemmetized_word = lemmatizer.lemmatize(token)
        cleanedSentence += lemmetized_word + " "
    temp.append(cleanedSentence)
  return temp

In [16]:
# Initializing a Python dictionary wherein
# keys: the tokenized, lemmatized review sentence
# values: the date (MM/DD/YY) that review was posted on

# Helper function
# Returns sentences as list
def parse_sentences_from_review_block(review):

  list_of_sentences = []
  
  current_sentence = ''
  previous_char = review[0]
  
  for character in review:
    
    # Encounter period -> assume sentence
    if character == '.':
      if current_sentence.strip('-. ') and not current_sentence.strip(',. ').isspace():
        list_of_sentences.append(current_sentence.strip('-.'))
      # Reset
      current_sentence = ''
    
    # Encounter hyphen -> assume sentence
    elif character == ' ' and previous_char == '-':
      if current_sentence.strip('-. ') and not current_sentence.strip(',. ').isspace():
        list_of_sentences.append(current_sentence.strip('.-'))
      # Reset
      current_sentence = ''
    
    # Continue
    current_sentence += character
    previous_char = character
  
  # Append whatever's left, if it hasn't already been caught
  if current_sentence.strip('-. ') and not current_sentence.strip(',. ').isspace():
    list_of_sentences.append(current_sentence.strip('.-'))
  
  return list_of_sentences

# column_name is either 'pros' or 'cons'
def create_dates_dictionary(df, column_name):
  dates_dict = {}

  for index, row in df.iterrows():

    date_string = row['author_info']
    review = row[column_name]

    review_sentences = parse_sentences_from_review_block(review)
    cleaned_sentences = tokenizeLemmatize(review_sentences)
    # print(cleaned_sentences)

    for each_sentence in cleaned_sentences:
      dates_dict[each_sentence] = date_string
    
  return dates_dict

In [17]:
dates_dict_pros = create_dates_dictionary(formatted_sony_df, "pros")
print(dates_dict_pros)

{'Excellent company to work for ': '06/01/2022', 'college are friendly and supportive ': '05/28/2022', 'work-life balanced ': '05/28/2022', 'I learned a lot in this company ': '07/28/2016', 'The team is amazing and it is very supportive ': '05/16/2022', 'a lot of flexibility ': '05/16/2022', 'Good pay really solid culture ': '05/11/2022', 'work with smart people work with cutting edge technology ': '05/24/2022', 'A great place to work ': '02/03/2021', 'Salary Work hour Health Care ': '04/26/2022', 'Creative environment that empowers you ': '04/25/2022', 'Sony a a company is great ': '03/30/2022', 'Benefits are really good ': '03/30/2022', 'Salaries are good ': '03/30/2022', "Sony a a whole care about it 's employee ": '03/30/2022', 'very good company to work ': '04/21/2022', 'good working atmosphere on site ': '05/03/2022', 'Office space is very nice ': '05/02/2022', 'fast promotion with competitive compensation ': '04/11/2022', 'Fun time working there in Sony ': '04/06/2022', 'Culture

In [18]:
dates_dict_cons = create_dates_dictionary(formatted_sony_df, "cons")
print(dates_dict_cons)

{'None to share , great environment to grow ': '06/01/2022', "There 's not much room for young people to grow up ": '05/28/2022', 'Sometimes it is hard to make connection with other people ': '05/16/2022', 'it not flexible depending on department ': '05/11/2022', 'toxic working environment low paid terrible WLB ': '05/24/2022', 'I hate working at this place ! ': '05/04/2022', 'Benefits Culture Work hour Health Care ': '04/26/2022', 'Not many con to speak of ': '04/25/2022', 'GISD ( the security group for Sony ) is the worst organization to work for ': '03/30/2022', 'Upper management refuse to make decision then blame employee for lack of progress ': '03/30/2022', "Too many people in the org do n't have any understanding of security or good enough IT knowledge to work at this level ": '03/30/2022', 'Moral is low in most group ': '03/30/2022', 'If you are lucky enough to work in a group that actually ha high moral it may not last long ': '03/30/2022', 'food in the kitchen wa not good ': 

# Zero-shot classification

Now, we're going to use zero-shot classification to classify our reviews
according to these axes: Culture and Values, Diversity and Inclusion, Work/Life Balance, Senior Management, Compensation and Benefits, and Career Opportunities. 

To do this, we'll first make a long list of all the sentences from our reviews. 

Then, we'll use BART from Hugging Face (https://huggingface.co/facebook/bart-large-mnli) to classify those sentences, putting them into appropriate lists! We will also keep them separated by negative and positive by assuming that whatever is under "pros" is positive, and whatever is under "cons" can be expected to be negative--this will be useful later on when we begin to use BERT for sentiment analysis.

In [20]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 55.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 42.0 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYA

In [21]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [22]:
reviewsPro = formatted_sony_df['pros']      # pros column
reviewsCon = formatted_sony_df['cons']      # cons column

Now that we have our data, we need to parse things into sentences. This is a bit tricky--there's no unified format for user reviews. Some people use periods to separate sentences. Some use hyphens. Some use both hyphens and periods, with additional hyphens. When the data is scraped from the Internet, newlines are obliterated, so we can't use those as clues.

Instead, we'll assume that if we encounter a period or an exclamation mark, we're ending a sentence. We'll also assume that if we encounter a hyphen with a space after it, we're ending a sentence. 

In [23]:
proSentences = []
conSentences = []

def parseSentencesFromReviewsColumn(reviewsColumn, reviewsList):
  # Iterate through the review at each row in the given column.
  for (columnName, columnData) in reviewsColumn.iteritems():
    currSent = ""
    prevChar = columnData[0]
    # columnData = one full positive review section. Loop thru char by char
    for character in columnData:
      # If we encounter a period, assume sentence, but only if the built string is
      # not spaces and not empty.
      if character == '.':
        if currSent.strip("-. ") and not currSent.strip(",. ").isspace():
          reviewsList.append(currSent.strip("-."))
        currSent = ""
      # If we encounter a "- ", assume sentence.
      elif character == ' ' and prevChar == '-':
        if currSent.strip("-. ") and not currSent.strip(",. ").isspace():
          reviewsList.append(currSent.strip(".-"))
      
        currSent = ""
    
      # Keep building the string and keep track of the prev char.
      currSent += character
      prevChar = character
      
    # Append whatever was left if it wasn't already caught
    if currSent.strip("-. ") and not currSent.strip(",. ").isspace():
      reviewsList.append(currSent.strip(".-"))

# Run on positive and negative reviews
parseSentencesFromReviewsColumn(reviewsPro, proSentences)
parseSentencesFromReviewsColumn(reviewsCon, conSentences)

# Sanity check
print(proSentences[:15])
print(conSentences[:15])

['Excellent company to work for', 'colleges are friendly and supportive', ' work-life balanced', 'I learned a lot in this company', ' The team is amazing and it is very supportive', ' a lot of flexibility', 'Good pay really solid culture', 'work with smart people work with cutting edge technology', 'A great place to work', 'Salary Work hours Health Care', 'Creative environment that empowers you', 'Sony as a company is great', ' Benefits are really good', ' Salaries are good', " Sony as a whole cares about it's employees"]
['None to share, great environment to grow', "There's not much room for young people to grow up", 'Sometimes it is hard to make connections with other people', 'its not flexible depending on department', 'toxic working environment low paid terrible WLB', 'I hate working at this place!', 'Benefits Culture Work hours Health Care', 'Not many cons to speak of', 'GISD (the security group for Sony) is the worst organization to work for', ' Upper management refuses to make d

In [None]:
import nltk
nltk.download('punkt')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from pprint import pprint

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [24]:
lemmatizer = WordNetLemmatizer()

def tokenizeLemmetize(reviews):
  temp = []
  for sentence in reviews:
    tokens = word_tokenize(sentence)
    cleanedSentence = ""
    for token in tokens:
        lemmetized_word = lemmatizer.lemmatize(token)
        cleanedSentence += lemmetized_word + " "
    temp.append(cleanedSentence)
  return temp

In [25]:
proSentencesCleaned = tokenizeLemmetize(proSentences)
conSentencesCleaned = tokenizeLemmetize(conSentences)

# Sanity check
print(proSentencesCleaned[:15])
print(conSentencesCleaned[:15])

['Excellent company to work for ', 'college are friendly and supportive ', 'work-life balanced ', 'I learned a lot in this company ', 'The team is amazing and it is very supportive ', 'a lot of flexibility ', 'Good pay really solid culture ', 'work with smart people work with cutting edge technology ', 'A great place to work ', 'Salary Work hour Health Care ', 'Creative environment that empowers you ', 'Sony a a company is great ', 'Benefits are really good ', 'Salaries are good ', "Sony a a whole care about it 's employee "]
['None to share , great environment to grow ', "There 's not much room for young people to grow up ", 'Sometimes it is hard to make connection with other people ', 'it not flexible depending on department ', 'toxic working environment low paid terrible WLB ', 'I hate working at this place ! ', 'Benefits Culture Work hour Health Care ', 'Not many con to speak of ', 'GISD ( the security group for Sony ) is the worst organization to work for ', 'Upper management refu

Now we're all set up to classify our sentences. We'll sort them into lists according to their valence and category (the six categories are given under "Zero-shot classification)--12 lists in total.

In [26]:
# These are the possible categories of relevance we have defined.
# Diversity and inclusion = 1
# Culture and values = 2
# Work life balance = 3
# Senior management = 4
# Career opportunities = 5
# Compensation and benefits = 6
candidate_labels = ['diversity and inclusion', 'culture and values', 'work life balance', 'senior management', 'career opportunities', 'compensation and benefits']
pro1 = []
con1 = []
pro2 = []
con2 = []
pro3 = []
con3 = []
pro4 = []
con4 = []
pro5 = []
con5 = []
pro6 = []
con6 = []

pros = [pro1, pro2, pro3, pro4, pro5, pro6]
cons = [con1, con2, con3, con4, con5, con6]

# Let's be picky and assume that if the top value is lower than 0.4, the
# sentence is not relevant.

def sortReviewSentencesUsingZeroShot(sentenceList, labeledContainers):
  for sentence in sentenceList:
    cat = classifier(sentence, candidate_labels)
    if float(cat['scores'][0]) > 0.4:
      label = cat['labels'][0]
      if label == candidate_labels[0]:
        labeledContainers[0].append(sentence)
      elif label == candidate_labels[1]:
        labeledContainers[1].append(sentence)
      elif label == candidate_labels[2]:
        labeledContainers[2].append(sentence)
      elif label == candidate_labels[3]:
        labeledContainers[3].append(sentence)
      elif label == candidate_labels[4]:
        labeledContainers[4].append(sentence)
      elif label == candidate_labels[5]:
        labeledContainers[5].append(sentence)

In [27]:
# Libraries needed to import/export files from/to drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [28]:
# Each list item is written on a separate line: lists within the list are 
# separated with the token "[LISTSEP]". For the filepath, you need to 
# input a directory that already exists in your drive. (e.g., 
# /content/drive/MyDrive/folderYouCreated/fileNameYouWant)

def writeListOfListsToFile(listThingy, filePath):
  with open(filePath, 'w') as writefile:
    for oneList in listThingy:
      for element in oneList:
        writefile.write(element)
        writefile.write('\n')
      writefile.write("[LISTSEP]\n")

In [29]:
sortReviewSentencesUsingZeroShot(proSentencesCleaned, pros)
writeListOfListsToFile(pros, '/content/drive/MyDrive/new_sony/sonyProsClassified_FIXED.txt')

In [None]:
sortReviewSentencesUsingZeroShot(conSentencesCleaned, cons)
writeListOfListsToFile(cons, '/content/drive/MyDrive/new_sony/sonyConsClassified_FIXED.txt')

Now we'll print some classifiers and store them for our confusion matrix.

In [None]:
def printClassifiersForConfusion(sentenceList, howMany, filePath):
    # print(classifier(sentenceList[count], candidate_labels))
  with open(filePath, 'w') as writefile:
    for i in range(howMany):
      writefile.write(str(classifier(sentenceList[i], candidate_labels)))
      writefile.write("\n")

In [None]:
printClassifiersForConfusion(proSentencesCleaned, 25, '/content/drive/MyDrive/new_sony/sonyPosConfusion_FIXED.txt')

In [None]:
printClassifiersForConfusion(conSentencesCleaned, 25, '/content/drive/MyDrive/new_sony/sonyNegConfusion_FIXED.txt')

### Retroactive date handling (pt. 2)

Producing a "parallel" text file of dates that, line-by-line, matches each sentence from the classification file with its authorship date.

In [19]:
print(dates_dict_pros)

{'Excellent company to work for ': '06/01/2022', 'college are friendly and supportive ': '05/28/2022', 'work-life balanced ': '05/28/2022', 'I learned a lot in this company ': '07/28/2016', 'The team is amazing and it is very supportive ': '05/16/2022', 'a lot of flexibility ': '05/16/2022', 'Good pay really solid culture ': '05/11/2022', 'work with smart people work with cutting edge technology ': '05/24/2022', 'A great place to work ': '02/03/2021', 'Salary Work hour Health Care ': '04/26/2022', 'Creative environment that empowers you ': '04/25/2022', 'Sony a a company is great ': '03/30/2022', 'Benefits are really good ': '03/30/2022', 'Salaries are good ': '03/30/2022', "Sony a a whole care about it 's employee ": '03/30/2022', 'very good company to work ': '04/21/2022', 'good working atmosphere on site ': '05/03/2022', 'Office space is very nice ': '05/02/2022', 'fast promotion with competitive compensation ': '04/11/2022', 'Fun time working there in Sony ': '04/06/2022', 'Culture

In [20]:
print(dates_dict_cons)

{'None to share , great environment to grow ': '06/01/2022', "There 's not much room for young people to grow up ": '05/28/2022', 'Sometimes it is hard to make connection with other people ': '05/16/2022', 'it not flexible depending on department ': '05/11/2022', 'toxic working environment low paid terrible WLB ': '05/24/2022', 'I hate working at this place ! ': '05/04/2022', 'Benefits Culture Work hour Health Care ': '04/26/2022', 'Not many con to speak of ': '04/25/2022', 'GISD ( the security group for Sony ) is the worst organization to work for ': '03/30/2022', 'Upper management refuse to make decision then blame employee for lack of progress ': '03/30/2022', "Too many people in the org do n't have any understanding of security or good enough IT knowledge to work at this level ": '03/30/2022', 'Moral is low in most group ': '03/30/2022', 'If you are lucky enough to work in a group that actually ha high moral it may not last long ': '03/30/2022', 'food in the kitchen wa not good ': 

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
category_delimiter = "[LISTSEP]\n"

def match_dates(filename_read, dates_dict):

  # Reading from file of classified review sentences
  f_read = open(filename_read, "r")
  classified_lines = f_read.readlines()

  dates_in_classification_order = []

  for line in classified_lines:
    if line != category_delimiter:

      # Remove trailing newline & leading whitespacing (single space)
      line_stripped = line.rstrip('\n').lstrip(' ')

      # Remove leading whitespace (single space)
      # line_bare_2 = line_bare.lstrip(' ')

      date_string = dates_dict[line_stripped]
      dates_in_classification_order.append(date_string)
    else:
      dates_in_classification_order.append(category_delimiter)
  
  # Writing corresponding dates into another file
  # f_write = open(filename_write, "w")

  f_read.close()
  
  return dates_in_classification_order

In [33]:
dates_pros = match_dates('/content/drive/MyDrive/new_sony/sonyProsClassified_FIXED.txt', dates_dict_pros)

In [34]:
print(dates_pros)

['03/04/2022', '10/25/2021', '02/10/2021', '01/24/2021', '12/30/2020', '11/10/2020', '03/11/2020', '11/09/2019', '05/09/2019', '04/13/2019', '12/03/2017', '02/05/2016', '12/26/2015', '12/07/2015', '11/17/2015', '10/16/2015', '03/28/2015', '03/04/2015', '01/27/2015', '01/08/2015', '05/31/2013', '09/17/2010', '05/02/2010', '05/22/2022', '09/19/2021', '08/30/2021', '08/05/2021', '07/09/2021', '03/21/2021', '03/10/2021', '01/02/2021', '09/17/2020', '06/24/2020', '06/02/2020', '07/13/2019', '05/12/2019', '04/01/2019', '03/11/2019', '11/13/2018', '06/10/2018', '03/16/2018', '09/12/2017', '08/15/2017', '05/22/2017', '04/04/2017', '01/24/2017', '01/24/2017', '01/21/2017', '12/12/2016', '09/22/2016', '09/19/2016', '07/19/2016', '07/15/2016', '04/21/2016', '04/18/2016', '01/17/2016', '06/10/2015', '07/19/2014', '05/21/2014', '01/07/2014', '10/30/2013', '09/05/2013', '06/24/2013', '06/24/2013', '06/22/2013', '03/07/2013', '10/20/2012', '09/19/2012', '09/23/2012', '08/25/2012', '08/22/2012', '08/0

In [23]:
dates_cons = match_dates('/content/drive/MyDrive/new_sony/sonyConsClassified.txt', dates_dict_cons)

In [24]:
print(dates_cons)

['09/29/2020', '02/25/2016', '11/19/2015', '06/08/2015', '05/02/2010', '10/23/2021', '07/20/2021', '03/23/2021', '11/04/2020', '06/01/2020', '07/23/2012', '12/06/2011', '10/15/2011', '[LISTSEP]\n', '03/30/2022', '03/30/2022', '04/04/2022', '02/24/2022', '02/12/2022', '12/29/2021', '12/05/2021', '08/23/2021', '06/21/2021', '03/04/2021', '10/14/2020', '08/08/2020', '05/13/2020', '05/13/2020', '04/07/2020', '04/04/2020', '04/04/2020', '02/04/2020', '01/13/2020', '11/04/2019', '08/12/2019', '05/31/2019', '05/31/2019', '04/13/2019', '03/07/2019', '02/13/2019', '09/27/2018', '09/27/2018', '09/27/2018', '09/06/2018', '05/22/2018', '03/19/2018', '01/26/2018', '12/05/2017', '05/20/2017', '04/26/2017', '05/13/2017', '04/10/2017', '02/14/2017', '02/10/2017', '11/29/2016', '09/19/2016', '07/18/2016', '07/18/2016', '07/12/2016', '04/25/2016', '04/02/2016', '12/01/2015', '09/02/2015', '08/26/2015', '07/29/2015', '06/17/2015', '06/01/2015', '06/11/2015', '06/11/2015', '05/01/2015', '04/25/2015', '03/

In [25]:
# Input:
#   (1) filepath
#   (2) resulting list of classification-ordered dates from match_dates
# Output:
#   Does not return anything, but writes dates to specified file(path).
def write_dates_to_file(filename_write, dates_ordered):

  f_write = open(filename_write, "w")

  for date_string in dates_ordered:
    # If [LISTSEP], don't write in newline
    if date_string[0] == "[":
      f_write.write("%s" % date_string)
    else:
      f_write.write("%s\n" % date_string)
  
  f_write.close()

In [38]:
# Pros
write_dates_to_file('/content/drive/MyDrive/dates/sonyPosDates.txt', dates_pros)

In [26]:
# Cons
write_dates_to_file('/content/drive/MyDrive/dates/sonyNegDates.txt', dates_cons)