# How does publication number and topic affect a Universities Ranking in Top 20 CS Schools?
### Authors: Mitchell Skopic and Kalyan Kanagala
## Table of Contents
1. Introduction to the Problem
2. Data Collection
3. Exploratory Data Analysis
4. Hypothesis Testing
5. Conclusion

## 1. Introduction to the Problem
### Imagine...
You're a bright, ambitious high school senior preparing to enter college. For years you've had a fascination with computers and the exciting research surrounding them. In short, you want to study and contribute to the field of computer science. But where should you study? Which Universities have research you're interested in? Does research influence the computer science education? What should you do?
### The Problem
There are numerous resources to determine top computer science programs, one of the most famous being U.S. News and World Report's. What factors does U.S. News and World Report use when determining rankings? How is research factored into that? And that is what we aim to find out.<br><br>
**How does research quantity and topic influence the rankings of computer science school?**<br><br>
This problem is important for several reasons. One, choosing a university is an extremely important decision, one which requires as much information as possible. Two, one gets out of university what they put in. In other words, if a student seeks an academic track, they would prefer a university with an emphasis on research and publication. If they seek a professional track, these metrics matter less. And three, knowing a schools area of research (and how prevalent their research is in the field) is important when making a decision. A systems research oriented student would not be university searching with the same criteria as a data science oriented student looking to work in industry.
### Overview
To answer these questions and help students make more informed university decisions, we will be analyzing data from csrankings.com in terms of their rankings as well as their data on computer science research from the past ten years. There have been previous analyses on faculty number and rankings (https://krixly.github.io/) however this projects seeks to find correlations between research quantity and area on rankings.
#### Ulterior Motives?
This analysis also doubles as a tutorial in data science, particularly the data science pipeline. Data science seeks to analyze data to extract meaning. In our case, this meaning comes in the form of understanding how research affects CS rankings. The pipeline we use includes: Data collection and organization, exploratory data analysis, hypothesis testing and machine learning, and finally interpretation and meaning explanation. To aid in this understanding, we have included extra resources in the form of links and definitions.<br>For more information on data science in general, visit: https://en.wikipedia.org/wiki/Data_science<br>For more information on the data science pipeline, visit: https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3

## 2. Data Collection and Organization
#### General Data Collection and Organization Info
The first step in the data science pipeline is data collection and organization. There are many ways to collect data that range from running your own scientific experiments, polling a sample, to scraping data from a web servie. Data scraping means using code to extract data from another program. A common practice in python data science is to read the data into a dataframe, which is a table structure which makes data manipulation and analysis easy. Once the data is in a dataframe certain issues arise such as missing or duplicated data. To combat this, we cleanse the data to ensure it has only valid values and won't be skewed by duplications. Finally, the last step of this section is data organization which is when you drop unwanted data and tidy the data. Tidy data is when each row represents a single observation (in our case, each row will be a university).<br>More info on tidy data(pdf download): https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
#### Our Data Collection and Organization
We scraped data from csrankings.org (see link below). This data contains the raw data used to create this sources Top 20 CS schools rankings. It includes information on professors, their publications, their associated universities, and the area of research. An important aspect of data scraping is its human readability. A csv file (Def: CSV- comma separated values) organizes rows of a data table by separating each data entry with a comma. We read our data into a dataframe.
<br>Follow the steps below to see how we scraped, cleansed, and tidied our data.
<br><br>Data came from: https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/csrankings.csv

In [2]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as display
%matplotlib inline

For the top 20 rankings data, we created a csv file from csrankings.org. Then we dropped two columns (group and faculty) because they would not be useful for us.

In [169]:
#read the top 20 rankings from csrankings.org (data was copied from the website to a csv)
top_20_universities_rank = pd.read_csv("top_20_cs_schools.csv")

#remove unnecesary columns
top_20_universities_rank.drop('group',axis = 1,inplace = True)
top_20_universities_rank.drop('faculty',axis = 1,inplace = True)

#get the list of top 20 schools
top_20 = list(top_20_universities_rank['University'])
rank = list(top_20_universities_rank['Rank'])

top_20_universities_rank

Unnamed: 0,Rank,University
0,1,Carnegie Mellon University
1,2,Massachusetts Institute of Technology
2,3,Stanford University
3,4,University of California - Berkeley
4,5,Univ. of Illinois at Urbana-Champaign
5,6,University of Michigan
6,7,Cornell University
7,8,University of Washington
8,9,Georgia Institute of Technology
9,10,University of California - San Diego


In [3]:
url = "https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/csrankings.csv"

In [15]:
#
#DO WE NEED THIS???
#
prof_names = pd.read_csv(url)
prof_names.head()

Unnamed: 0,name,affiliation,homepage,scholarid
0,A. Aldo Faisal,Imperial College London,https://www.imperial.ac.uk/people/a.faisal,WjHjbrwAAAAJ
1,A. Antony Franklin,IIT Hyderabad,http://www.iith.ac.in/~antony/index.html,LVfqLuoAAAAJ
2,A. C. Cem Say,Boğaziçi University,https://www.cmpe.boun.edu.tr/~say,rOum2XsAAAAJ
3,A. C. W. Finkelstein,University College London,http://www0.cs.ucl.ac.uk/staff/A.Finkelstein,n8xuCVkAAAAJ
4,A. Cüneyd Tantug,Istanbul Technical University,http://tantug.com,TTawdWMAAAAJ


Next, we read data from https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/generated-author-info.csv . This process used a python function, read_csv() to scraped the data directly from a webpage. Here is the first time we are doing cleansing. <br>The potential problem: This data could contain entries we don't need or could have invalid entries (for example, if an entry does not contain a value for publication area). <br>Solution: cleanse the data to ensure the data we're using will give us the most relevant and valid results. The first cleansing step is to drop columns that we do not need (such as professor name). We also rename columns (dept to university) to better understand what the data means in the context of our analysis.

In [154]:
#read data from csv file into a pandas dataframe
professor_publication_info_csv = "https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/generated-author-info.csv"
professor_publication_info = pd.read_csv(professor_publication_info_csv)

#drop the professor's name from the table as we don't need this information
professor_publication_info.drop('name',axis = 1,inplace = True)

#drop the count and adjustedcount fields. They are not relevant to our analysis
professor_publication_info.drop('count',axis = 1,inplace = True)
professor_publication_info.drop('adjustedcount',axis = 1,inplace = True)

#rename the column label from 'dept' to 'university' because it better represents how we understand the data
professor_publication_info.rename(columns={"dept":"university"}, inplace=True)

#rename the column label from 'area' to 'research area' because it better represents how we understand the data
professor_publication_info.rename(columns={"area":"research area"}, inplace=True)

professor_publication_info.head()

Unnamed: 0,university,research area,year
0,Imperial College London,icra,2016
1,Istanbul Technical University,acl,2007
2,VU Amsterdam,ijcai,2007
3,Bilkent University,ismb,2014
4,George Mason University,fse,2016


The next step of cleansing the data set is to remove all entries that aren't ranked in the top 20 universities. We show the significance of this step by printing the total number of entries and then the number of remaining entries after we cleanse the unwanted entries. All of that unwanted data would invalidate any data analysis.

In [155]:
print(f'Number of total entries before cleansing by year: {len(professor_publication_info)}')

valid_years = ['2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018']

#first step is to drop all the entries with data older than 2008
professor_publication_info = professor_publication_info[professor_publication_info['year'].isin(valid_years)]

print(f'Number of total entries after cleansing by year: {len(professor_publication_info)}')

Number of total entries before cleansing by year: 101597
Number of total entries after cleansing by year: 61801


In [156]:
#this section of cleansing will remove any data from a university not in the top 20
print(f'Number of total entries after cleansing by uni: {len(professor_publication_info)}')

#dropping anything not related to top 20 universities
professor_publication_info = professor_publication_info[professor_publication_info['university'].isin(top_20)]

print(f'Number of total entries after cleansing by uni: {len(professor_publication_info)}')

Number of total entries after cleansing by uni: 61801
Number of total entries after cleansing by uni: 17593


As we can see, the cleansing was an extremely important to reducing the number of data entries from 101597 to 17593. Now we're left with data relating only to publications at the top 20 universities from the past 10 years (2008-2018)<br><br>The next step is to ensure there is no missing data. We do this by checking the unique values of each column. If there is something unexpected, such as a 'research area' entry which is not one validated on csrankings.com, we should remove that entry. We accomplish this by looking at the output of unique entries and looking for anything invalid.

In [157]:
professor_publication_info['research area'].unique()
#by checking which research areas are used to calculate ranking, we found that:
#pets is not tracked
#chiconf not tracked

print(f'Number of total entries before cleansing untracked research areas: {len(professor_publication_info)}')
publications_including_untracked = professor_publication_info.copy(deep=True)
untracked = ['pets', 'chiconf']
untracked_publications = untracked_publications[untracked_publications['research area'].isin(untracked)]

#remove untracked research areas from data
professor_publication_info = professor_publication_info[~professor_publication_info['research area'].isin(untracked)]
print(f'Number of total entries after cleansing untracked research areas: {len(professor_publication_info)}')

Number of total entries before cleansing untracked research areas: 17593
Number of total entries after cleansing untracked research areas: 16826


When looking for invalid data, we found two areas of research which were not part of the set of areas which were considered in the rankings. So, we removed those entries from the main publication data table. Prior to this, we copied the dataframe so we have a version with all publications including those untracked.<br>We did this because it raises the question of why these research areas are not considered. Later, we will look at how much this missing data affects ranking.<br><br>The final step of data collection and organization is tidying the data! We do this by combining all the data into a new dataframe where each row corresponds to a university. We will also begin doing data collection on our data. This means summing things such as number of total publications per school and the number of publications per school in each of the 4 categories (AI, Systems, Theory, and Interdisciplinary). This step is essential. Once the dataframe is formed, each column serves a dependent variable which makes streamlines plotting and analysis.

In [172]:
#Tidying
#totals for finding each school's percentage of publication per area
num_ai = 0
num_systems = 0
num_theory = 0
num_inter = 0
num_untracked = 0
num_pubs_with_untracked = len(publications_including_untracked)

#columns of the new dataframe
total_pubs = np.zeros(20) #total pubs for each school
total_pubs_untracked = np.zeros(20) #total pubs for each school, including untracked pubs
total_ai = np.zeros(20) #total ai pubs for each school
total_theory = np.zeros(20) #total systems pubs for each school
total_systems = np.zeros(20) #total theory pubs for each school
total_inter = np.zeros(20) #total interdisciplinary pubs for each school
total_untracked = np.zeros(20) #untracked are treated as their own category

percentage_pubs = np.zeros(20) #percentage of total pubs
percentage_pubs_untracked = np.zeros(20) #percentage of total pubs, including untracked
percentage_ai = np.zeros(20)
percentage_theory = np.zeros(20)
percentage_systems = np.zeros(20)
percentage_inter = np.zeros(20)
percentage_untracked = np.zeros(20) #percentage of untracked pubs for each school

#list of each research area in the 4 categories:
ai = ['aaai','ijcai','cvpr','eccv','iccv','icml','kdd','nips','acl','emnlp','naacl','sigir','www']
systems = ['asplos','isca','micro','hpca','sigcomm','nsdi','ccs','oakland','usenixsec','ndss','sigmod',\
          'vldb','icde','pods','dac','iccad','emsoft','rtas','rtss','hpdc','ics','sc','mobicom','mobisys',\
          'sensys','imc','sigmetrics','osdi','sosp','eurosys','fast','usenixatc','pldi','popl','icfp','oopsla',\
          'fse','icse','ase','issta']
theory = ['focs','soda','stoc','crypto','eurocrypt','cav','lics']
inter = ['ismb','recomb','siggraph','siggraph-asia','ec','wine','ubicomp','uist','icra','iros',\
         'rss','vis','vr']

#first divide dataframe by categories
ai_df = professor_publication_info[professor_publication_info['research area'].isin(ai)]
systems_df = professor_publication_info[professor_publication_info['research area'].isin(systems)]
theory_df = professor_publication_info[professor_publication_info['research area'].isin(theory)]
inter_df = professor_publication_info[professor_publication_info['research area'].isin(inter)]

#totals of each category
num_ai = len(ai_df)
num_systems = len(systems_df)
num_theory = len(theory_df)
num_inter = len(inter_df)
num_untracked = len(untracked_publications)

#traverse each one to get num of each category
for j in range(0, len(top_20)) :
    total_ai[j] = len(ai_df[ai_df['university'].isin([top_20[j]])])
    total_systems[j] = len(systems_df[systems_df['university'].isin([top_20[j]])])
    total_theory[j] = len(theory_df[theory_df['university'].isin([top_20[j]])])
    total_inter[j] = len(inter_df[inter_df['university'].isin([top_20[j]])])
    total_untracked[j] = len(untracked_publications[untracked_publications['university'].isin([top_20[j]])])

#traverse each one to get total percentage
for j in range(0, len(top_20)) :
    percentage_ai[j] = total_ai[j] / num_ai
    percentage_theory[j] = total_systems[j] / num_systems
    percentage_systems[j] = total_theory[j] / num_theory
    percentage_inter[j] = total_inter[j] / num_inter
    percentage_untracked[j] = total_untracked[j] / num_untracked
    
    #total of all pubs
    total_pubs[j] = total_ai[j] + total_systems[j] + total_theory[j] + total_inter[j]
    total_pubs_untracked[j] = total_ai[j] + total_systems[j] + total_theory[j] + total_inter[j] + total_untracked[j]
    percentage_pubs[j] = total_pubs[j] / len(professor_publication_info)
    percentage_pubs_untracked[j] = total_pubs_untracked[j] / num_pubs_with_untracked
    
#make new Tidy dataframe
d = {'University' : top_20, 'Rank' : rank, 'Total Pubs' : total_pubs, '% Total Pubs' : percentage_pubs,\
    '# AI Pubs' : total_ai, '% AI Pubs' : percentage_ai, '# Systems Pubs' : total_systems, '% Systems Pubs' :\
     percentage_systems, '# Theory Pubs' : total_theory, '% Theory Pubs' : percentage_theory, '# Inter Pubs' : total_inter,\
    '% Inter Pubs' : percentage_inter, 'Total Pubs w/ Untracked' : total_pubs_untracked, '% Total w/ Untracked' : \
    total_pubs_untracked, '# Untracked Pubs' : total_untracked, '% Untracked Pubs' : percentage_untracked}
top_20_publication_data = pd.DataFrame(data = d)
top_20_publication_data 

767
168


Unnamed: 0,University,Rank,Total Pubs,% Total Pubs,# AI Pubs,% AI Pubs,# Systems Pubs,% Systems Pubs,# Theory Pubs,% Theory Pubs,# Inter Pubs,% Inter Pubs,Total Pubs w/ Untracked,% Total w/ Untracked,# Untracked Pubs,% Untracked Pubs
0,Carnegie Mellon University,1,1885.0,0.112029,762.0,0.14997,547.0,0.097336,179.0,0.075741,397.0,0.147914,2053.0,2053.0,168.0,0.219035
1,Massachusetts Institute of Technology,2,1200.0,0.071318,310.0,0.061012,478.0,0.096792,178.0,0.066187,234.0,0.087183,1224.0,1224.0,24.0,0.031291
2,Stanford University,3,1110.0,0.065969,358.0,0.070459,399.0,0.072322,133.0,0.055248,220.0,0.081967,1157.0,1157.0,47.0,0.061278
3,University of California - Berkeley,4,1127.0,0.06698,326.0,0.064161,474.0,0.060903,112.0,0.065633,215.0,0.080104,1161.0,1161.0,34.0,0.044329
4,Univ. of Illinois at Urbana-Champaign,5,1050.0,0.062403,232.0,0.04566,630.0,0.053834,99.0,0.087233,89.0,0.033159,1095.0,1095.0,45.0,0.05867
5,University of Michigan,6,1016.0,0.060383,250.0,0.049203,590.0,0.027189,50.0,0.081695,126.0,0.046945,1074.0,1074.0,58.0,0.075619
6,Cornell University,7,985.0,0.05854,405.0,0.079709,325.0,0.050571,93.0,0.045001,162.0,0.060358,1050.0,1050.0,65.0,0.084746
7,University of Washington,8,939.0,0.055806,182.0,0.03582,451.0,0.05764,106.0,0.062448,200.0,0.074516,1062.0,1062.0,123.0,0.160365
8,Georgia Institute of Technology,9,964.0,0.057292,332.0,0.065341,370.0,0.036433,67.0,0.051232,195.0,0.072653,1035.0,1035.0,71.0,0.092568
9,University of California - San Diego,10,704.0,0.04184,147.0,0.028931,388.0,0.047852,88.0,0.053725,81.0,0.030179,740.0,740.0,36.0,0.046936


#### Final Tidy Table
We have a final tidy table where each row is a university. This is the first time we have all of the data from the original 16828 table entires compiled in one place. Now that the data is scraped, cleansed, and tidied, we're ready to move onto the next phase of the data science pipeline: Exploratory Data Analysis

In [87]:
#
#DO WE NEED THIS????
#
undergraduate_number_cs = pd.read_csv("undergraduate_cs_research.csv")
undergraduate_number_cs.rename(columns={"dept":"university"}, inplace=True)
#
top20_US_Universities = \
professor_publication_info[professor_publication_info['university'].isin(top_20)]
final_dataframe = top20_US_Universities.groupby('university').university.count()
#final_dataframe

## 3. Exploratory Data Anlysis (EDA)
#### General EDA Info
This step of the data science pipeline is for getting a sense of the data and what possible meaning could be extracted from it. This is done through a variety of methods, but most commonly done by visualizing the data in charts and graphs.<br>The NIST Engineering Handbook describes EDA as, "an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to:
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings."

For a deeper look at more EDA methods and concepts, this NIST handbook is a useful resource (https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm)<br>
#### Our EDA
In this step, we will not be answering the question of CS rankings and research, just representing important aspects of the data which we may or may not use for our analysis.<br>Before we start analyzing and plotting, we need to create a dataframe with all the data we will need to use for analysis. Currently our data is organized by one publication per row. We want to reorganize it (called 'tidy' data) so each row is one of the top 20 schools

In [None]:
#plots to make: bar plot of all publications for each school
    #add up all publications by school
#plot of all including untracked
    #all by with including untracked
#plot of publication by year (line per school)
#plot of types of publications by school
#plot of diverse publications (by larger cartegory, not all 73 types)

## 5. Conclusion
IMPORTANT FOR KALYAN!!! The two untracked reseach areas are Human-Computer interaction (chiconf) and I couldn't figure out what 'pets' was. But defnitely write about the untracked research areas. I determined these because on the csrankings.org website they list 73 research areas used to calculate the list. Those two areas are in the dataset but don't factor in for some reason<br><br>Another important thing to write about is how this ranking is totally quantitative, while US NEws and world report uses polling data from department heads (who are definitely biased towards their own universities)<br><br>
(fill this in with conclusion stuff. I just wanted to write the very end so it ties back to the beginning 'Imagine' section)
<br><br>So...<br>Now you're well versed in both how research influences university rankings and the data science pipeline. With this tutorial you could perform your own analyses on how publication affects other ranking systems (U.S. News and World Report?) or any data driven question. Curious about how poverty in America vs the rest of the world? The effect of temperature on NFL football games? Use these tools to master data and extract meaning from the world around you!<br><br>For a more comprehensive look at data science and machine learning, there are many online classes. Here are a couple highly regarded classses:<br>https://www.class-central.com/course/udacity-intro-to-data-analysis-4937<br>https://www.udemy.com/machinelearning/<br>https://www.udemy.com/data-science-and-machine-learning-with-python-hands-on/