# Citatation Counting

This notebook counts references to canonical authors in articles by ngrams. This guide will help users update works and get new word counts for new authors by outlining and explaining key code sections.

## Setup

First, we will ensure we have all the necessary tools installed and imported

In [40]:
!pip install tqdm
!pip install tables
from tqdm.notebook import tqdm



In [41]:
import os
import csv
import pandas as pd

Next, we will set up the necessary filepaths to read our JSTOR data. First, we will define our JSTOR HOME data filepath. Next, we'll define our ngram.txt filepaths by reading in filtered_index.csv. Finally, we will read in the expanded dictionary files for cultural, relational, and demographic sets. 

In [42]:
JSTOR_HOME = "../../jstor_data"

In [43]:
INDICES = "./filtered_index.csv"

with open(INDICES, 'r') as f:
    files = f.read().split('\n')[:-1]

In [44]:
# expanded_dict_folder = "../Dictionaries/Expanded/wordnet_english2/"
expanded_dict_folder = "../dictionaries/expanded/"
full_cultural = expanded_dict_folder + "closest_culture_1000.csv"
full_relational = expanded_dict_folder + "closest_relational_1000.csv"
full_demographic = expanded_dict_folder + "closest_demographic_1000.csv"
full_cultural_set = set()
full_relational_set = set()
full_demographic_set = set()

csv_lst = [full_cultural, full_demographic, full_relational]
set_lst = [full_cultural_set, full_demographic_set, full_relational_set]

for i in range(3):
    with open(csv_lst[i], 'r') as f:
        reader = csv.reader(f)
        for line in reader:
            set_lst[i].add(' '.join(line[0].split('_')))

        
all_terms = set.union(full_cultural_set, full_relational_set, full_demographic_set)

## Update Words

Here, we'll create a list of the authors we want to find within our citation, categorizing them as demographic, relational, or cultural. Edit the python lists in the following code snipped by adding or removing authors in order to get new or updated word counts for specific authors

In [45]:
demographic_authors = ['hannan freeman', 'barnett carroll', 'barron west', 'brüderl schüssler', 'carrol hannan', 
                       'freeman carrol', 'fichman levinthal', 'carrol']
## Will leaving in single authors catch extra citations?

relational_authors = ['pfeffer salancik', 'burt christman', 'pfeffer nowak', 'pfeffer']

cultural_authors = ['meyer rowan', 'dimaggio powell', 'powell dimaggio', 'oliver', 'powell', 'scott', 'weick']

ALL_AUTHORS = set(demographic_authors + relational_authors + cultural_authors)

author_types_list = [cultural_authors, demographic_authors, relational_authors]

## Word Counts

Using the ngram.txt files we previously read in from filtered_index.csv, we will now parse the ngram files and collect and store the word counts of various authors mentioned in the citations of the JSTOR Articles. After creating this dataframe, we can take a look at the total number of words for each category (i.e. cultural, demographic, relational) 

First, we'll create our dataframe that will contain our final information, as well as a list containing the names of the various perspectives we use when collecting ngram counts.

In [46]:
counts_df = pd.DataFrame(columns=["article_id", "cultural_author_count", "demographic_author_count", "relational_author_count",
                                 "cultural_count2", "relational_count2", "demographic_count2",
                                 "cultural_count1", "relational_count1", "demographic_count1"]) 

perspective_types = ["cultural", "demographic", "relational"]

The following method will generate ngram counts by parsing through our JSTOR article files, and collecting and storing the word counts for the authors mentioned in the JSTOR article. This method takes in a **NGRAM value** (1 = unigram, 2 = bigram, etc.) as well as the **dataframe to update** (in this case, the `counts_df` dataframe we created previously)

In [48]:
def generate_ngram_counts(ngram_value, counts_df):
    folder = os.path.join(JSTOR_HOME, 'ngram{}'.format(ngram_value))

    for file in tqdm(files):
        with open(os.path.join(folder, '{}-ngram{}.txt'.format(file, ngram_value)), 'r') as f:

            d = {}

            for line in f.read().splitlines():
                k, v = line.split('\t')
                if k in ALL_AUTHORS or k in all_terms:
                    d[k] = int(v)
                    
            author_sums = [sum([d.get(author, 0) for author in author_list]) for author_list in author_types_list]
            term_sums = [sum([d.get(term, 0) for term in set_type]) for set_type in set_lst]
                    
            if (ngram_value == 1):
                for i in range(3):
                    counts_df.at[file, perspective_types[i] + "_author_count"] = author_sums[i]
                    counts_df.at[file, perspective_types[i] + "_count1"] = term_sums[i]
                

            elif (ngram_value == 2):
                row = {"article_id": file}
                for i in range(3):
                    row[perspective_types[i] + "_author_count"] = author_sums[i]
                    row[perspective_types[i] + "_count2"] = term_sums[i]

                counts_df = counts_df.append(row, ignore_index=True)
            
            counts_df = counts_df.set_index('article_id')
        

### Bigrams

For this section, we will specifically look at **ngram = 2 words (i.e. bigrams)**. After creating this dataframe, we can take a look at the total number of words for each category (i.e. cultural, demographic, relational) 

In [None]:
generate_ngram_counts(2, counts_df)
counts_df.head()

  0%|          | 0/69658 [00:00<?, ?it/s]

In [None]:
counts_df.sum()

### Unigrams

We will use the same method as used in the previous section, and will specifically look at **ngram = 1 words (i.e. unigrams)**. After creating this dataframe, we can take a look at the total number of words for each category (i.e. cultural, demographic, relational).

In [None]:
generate_ngram_counts(1, counts_df)
counts_df.head()

In [None]:
counts_df.sum()

## Wrap-up + storage

We can now write our dataframe back to our csv files to later parse and use. Uncomment the below line to write back to the `citation_and_expanded_dict_count_may7.csv` file, or change the passed in file name to write to a different file.

In [None]:
# counts_df.to_csv('citation_and_expanded_dict_count_may7.csv', index=True)