# Citation Counting

@authors: Zekai Fan, UC Berkeley; Deepak Ragu, UC Berkeley; Jaren Haber, PhD, Dartmouth College <br>
@PI: Prof. Heather Haveman, UC Berkeley <br>
#date_modified: December 2022 <br>
@contact: jhaber@berkeley.edu <br>
@inputs: list of authors <br>
@outputs: count of authors (citation_and_expanded_dict_count_{thisdate}.csv), where 'thisdate' is in mmddyy format <br>
@description: Counts references to canonical authors in articles by ngrams. This guide will help users update works and get new word counts for new authors by outlining and explaining key code sections. <br>

## Setup

First, we will ensure we have all the necessary tools installed and imported

In [1]:
#!pip install tqdm
#!pip install tables
from tqdm.notebook import tqdm

In [2]:
import os
import csv
import pandas as pd
from datetime import date
thisday = date.today().strftime("%m%d%y")

Next, we will set up the necessary filepaths to read our JSTOR data. First, we will define our JSTOR HOME data filepath. Next, we'll define our ngram.txt filepaths by reading in filtered_index.csv. Finally, we will read in the expanded dictionary files for cultural, relational, and demographic sets. 

In [3]:
JSTOR_HOME = "../../jstor_data"

In [4]:
INDICES = "../article_data/filtered_length_index.csv"

with open(INDICES, 'r') as f:
    files = f.read().split('\n')[:-1]
    files = [fp.split(',')[1] for fp in files[:500]]

In [5]:
# expanded_dict_folder = "../Dictionaries/Expanded/wordnet_english2/"
expanded_dict_folder = "../dictionaries/expanded/"
full_cultural = expanded_dict_folder + "closest_culture_1000.csv"
full_relational = expanded_dict_folder + "closest_relational_1000.csv"
full_demographic = expanded_dict_folder + "closest_demographic_1000.csv"
full_cultural_set = set()
full_relational_set = set()
full_demographic_set = set()

csv_lst = [full_cultural, full_demographic, full_relational]
set_lst = [full_cultural_set, full_demographic_set, full_relational_set]

for i in range(3):
    with open(csv_lst[i], 'r') as f:
        reader = csv.reader(f)
        for line in reader:
            set_lst[i].add(' '.join(line[0].split('_')))

        
all_terms = set.union(full_cultural_set, full_relational_set, full_demographic_set)

In [6]:
# Show dictionaries
[list(termlist)[:10] for termlist in set_lst]

[['gave rise',
  'superseded',
  'vulgar',
  'imagines',
  'everyday',
  'argues',
  'representations',
  'imagined',
  'deconstructionist',
  'totalitarian'],
 ['font libraries',
  'dedicated biotechnology',
  'enjoyed netscape',
  'taper integration',
  'blade server',
  'cosmetic speculative',
  'breweries carroll',
  'convergence upheaval',
  'expropriation bonding',
  'mover disadvantages'],
 ['font libraries',
  'interdependences',
  'owner managed',
  'dedicated biotechnology',
  'executive compensation',
  'taper integration',
  'cash flow',
  'portfolio restructuring',
  'deeds hill',
  'opportunism']]

## Update Words

Here, we'll create a list of the authors we want to find within our citation, categorizing them as demographic, relational, or cultural. Edit the python lists in the following code snipped by adding or removing authors in order to get new or updated word counts for specific authors

### Capture author citations

In [7]:
cultural_authors = ['meyer rowan', 'dimaggio powell', 'powell dimaggio', 'oliver', 'powell', 'scott', 'weick']

demographic_authors = ['hannan freeman', 'barnett carroll', 'barron west', 'brüderl schüssler', 'carrol hannan', 
                       'freeman carrol', 'fichman levinthal', 'carrol']

relational_authors = ['pfeffer salancik', 'burt christman', 'pfeffer nowak', 'pfeffer']

ALL_AUTHORS = set(demographic_authors + relational_authors + cultural_authors)

author_types_list = [cultural_authors, demographic_authors, relational_authors]

### Capture foundational terms (using dictionaries)

In [8]:
# Load original dictionaries
cult_orig = pd.read_csv('../dictionaries/original/cultural_original.csv', delimiter = '\n', 
                        header=None)[0].apply(lambda x: x.replace(',', ' '))
cultural_authors = cult_orig.tolist()

dem_orig = pd.read_csv('../dictionaries/original/demographic_original.csv', delimiter = '\n', 
                       header=None)[0].apply(lambda x: x.replace(',', ' '))
demographic_authors = dem_orig.tolist()

relt_orig = pd.read_csv('../dictionaries/original/relational_original.csv', delimiter = '\n', 
                        header=None)[0].apply(lambda x: x.replace(',', ' '))
relational_authors = relt_orig.tolist()

ALL_AUTHORS = set(cultural_authors + demographic_authors + relational_authors) # full list of dictionaries

author_types_list = [cultural_authors, demographic_authors, relational_authors]

## Word Counts

Using the ngram.txt files we previously read in from filtered_index.csv, we will now parse the ngram files and collect and store the word counts of various authors mentioned in the citations of the JSTOR Articles. After creating this dataframe, we can take a look at the total number of words for each category (i.e. cultural, demographic, relational) 

First, we'll create our dataframe that will contain our final information, as well as a list containing the names of the various perspectives we use when collecting ngram counts.

In [9]:
counts_df = pd.DataFrame(columns=["article_id", "cultural_author_count", "demographic_author_count", "relational_author_count",
                                 "cultural_count3", "demographic_count3", "relational_count3", 
                                 "cultural_count2", "demographic_count2", "relational_count2", 
                                 "cultural_count1", "demographic_count1", "relational_count1"]) 

perspective_types = ["cultural", "demographic", "relational"]

In [10]:
def generate_ngram_counts(ngram_value, counts_df):
    '''Generates ngram counts by parsing through JSTOR article files, and collecting and storing the word counts 
    for the authors mentioned in the JSTOR article. 
    
    Args: 
        ngram_value (int): how many words to count: 1 = unigram, 2 = bigram, 3 = trigram
        counts_df(pd.DataFrame): the dataframe to update'''
    
    folder = os.path.join(JSTOR_HOME, 'ngram{}'.format(ngram_value))

    for file in tqdm(files):
        with open(os.path.join(folder, '{}-ngram{}.txt'.format(file, ngram_value)), 'r') as f:

            d = {}

            for line in f.read().splitlines():
                k, v = line.split('\t')
                if k in ALL_AUTHORS or k in all_terms:
                    d[k] = int(v)
                    
            author_sums = [sum([d.get(author, 0) for author in author_list]) for author_list in author_types_list]
            term_sums = [sum([d.get(term, 0) for term in set_type]) for set_type in set_lst]

            row = {"article_id": file}
            for i in range(3):
                row[perspective_types[i] + "_author_count"] = author_sums[i]
                row[perspective_types[i] + "_count" + str(ngram_value)] = term_sums[i]

            counts_df = counts_df.append(row, ignore_index=True)
            
    return counts_df

### Trigrams (3 words)

In [11]:
counts_df = generate_ngram_counts(3, counts_df)
counts_df.head()

  0%|          | 0/500 [00:00<?, ?it/s]

Unnamed: 0,article_id,cultural_author_count,demographic_author_count,relational_author_count,cultural_count3,demographic_count3,relational_count3,cultural_count2,demographic_count2,relational_count2,cultural_count1,demographic_count1,relational_count1
0,journal-article-10.2307_2065002,0,0,0,0,0,0,,,,,,
1,journal-article-10.2307_3380821,0,0,0,0,0,0,,,,,,
2,journal-article-10.2307_2095822,0,0,0,0,0,0,,,,,,
3,journal-article-10.2307_40836133,0,0,0,0,0,0,,,,,,
4,journal-article-10.2307_2579666,0,0,0,0,0,0,,,,,,


In [12]:
counts_df.sum()

article_id                  journal-article-10.2307_2065002journal-article...
cultural_author_count                                                       0
demographic_author_count                                                    2
relational_author_count                                                     0
cultural_count3                                                             0
demographic_count3                                                          0
relational_count3                                                           0
cultural_count2                                                             0
demographic_count2                                                          0
relational_count2                                                           0
cultural_count1                                                             0
demographic_count1                                                          0
relational_count1                                               

### Bigrams (2 words)

In [13]:
counts_df = generate_ngram_counts(2, counts_df)
counts_df.head()

  0%|          | 0/500 [00:00<?, ?it/s]

Unnamed: 0,article_id,cultural_author_count,demographic_author_count,relational_author_count,cultural_count3,demographic_count3,relational_count3,cultural_count2,demographic_count2,relational_count2,cultural_count1,demographic_count1,relational_count1
0,journal-article-10.2307_2065002,0,0,0,0,0,0,,,,,,
1,journal-article-10.2307_3380821,0,0,0,0,0,0,,,,,,
2,journal-article-10.2307_2095822,0,0,0,0,0,0,,,,,,
3,journal-article-10.2307_40836133,0,0,0,0,0,0,,,,,,
4,journal-article-10.2307_2579666,0,0,0,0,0,0,,,,,,


In [14]:
counts_df.sum()

article_id                  journal-article-10.2307_2065002journal-article...
cultural_author_count                                                     170
demographic_author_count                                                  558
relational_author_count                                                   735
cultural_count3                                                             0
demographic_count3                                                          0
relational_count3                                                           0
cultural_count2                                                           633
demographic_count2                                                       2488
relational_count2                                                        3591
cultural_count1                                                             0
demographic_count1                                                          0
relational_count1                                               

### Unigrams (1 word)

In [15]:
counts_df = generate_ngram_counts(1, counts_df)
counts_df.head()

  0%|          | 0/500 [00:00<?, ?it/s]

Unnamed: 0,article_id,cultural_author_count,demographic_author_count,relational_author_count,cultural_count3,demographic_count3,relational_count3,cultural_count2,demographic_count2,relational_count2,cultural_count1,demographic_count1,relational_count1
0,journal-article-10.2307_2065002,0,0,0,0,0,0,,,,,,
1,journal-article-10.2307_3380821,0,0,0,0,0,0,,,,,,
2,journal-article-10.2307_2095822,0,0,0,0,0,0,,,,,,
3,journal-article-10.2307_40836133,0,0,0,0,0,0,,,,,,
4,journal-article-10.2307_2579666,0,0,0,0,0,0,,,,,,


In [16]:
counts_df.sum()

article_id                  journal-article-10.2307_2065002journal-article...
cultural_author_count                                                   12839
demographic_author_count                                                10344
relational_author_count                                                 31182
cultural_count3                                                             0
demographic_count3                                                          0
relational_count3                                                           0
cultural_count2                                                           633
demographic_count2                                                       2488
relational_count2                                                        3591
cultural_count1                                                         63307
demographic_count1                                                      40319
relational_count1                                               

## Wrap-up + storage

Since the method above gives each article three separate rows--one for each ngram size (1, 2, and 3)--to save space let's now condense each article into one row using a `groupby()` in pandas.

In [17]:
print(counts_df.shape) # number of rows, cols
print(counts_df.nunique()) # number of rows vs number of unique

(1500, 13)
article_id                  500
cultural_author_count        93
demographic_author_count     78
relational_author_count     158
cultural_count3               1
demographic_count3            1
relational_count3             1
cultural_count2              16
demographic_count2           42
relational_count2            54
cultural_count1             240
demographic_count1          178
relational_count1           232
dtype: int64


In [18]:
counts_df = counts_df.groupby('article_id').sum().reset_index(drop = False) # collapse so one article = one row
counts_df.shape

(500, 13)

We can now write our dataframe back to our csv files to later parse and use. Uncomment the below line to write back to the `citation_and_expanded_dict_count_may7.csv` file, or change the passed in file name to write to a different file.

In [19]:
#counts_df.to_csv(f'citation_and_expanded_dict_count_{thisday}.csv', index=True)