# How does publication number and topic affect a Universities Ranking in Top 20 CS Schools?
### Authors: Mitchell Skopic and Kalyan Kanagala
## Table of Contents
1. Introduction to the Problem
2. Data Collection
3. Exploratory Data Analysis
4. Hypothesis Testing
5. Conclusion

## 1. Introduction to the Problem
### Imagine...
You're a bright, ambitious high school senior preparing to enter college. For years you've had a fascination with computers and the exciting research surrounding them. In short, you want to study and contribute to the field of computer science. But where should you study? Which Universities have research you're interested in? Does research influence the computer science education? What should you do?
### The Problem
There are numerous resources to determine top computer science programs, one of the most famous being U.S. News and World Report's. What factors does U.S. News and World Report use when determining rankings? How is research factored into that? And that is what we aim to find out.<br><br>
**How does research quantity and topic influence the rankings of computer science school?**<br><br>
This problem is important for several reasons. One, choosing a university is an extremely important decision, one which requires as much information as possible. Two, one gets out of university what they put in. In other words, if a student seeks an academic track, they would prefer a university with an emphasis on research and publication. If they seek a professional track, these metrics matter less. And three, knowing a schools area of research (and how prevalent their research is in the field) is important when making a decision. A systems research oriented student would not be university searching with the same criteria as a data science oriented student looking to work in industry.
### Overview
To answer these questions and help students make more informed university decisions, we will be analyzing data from csrankings.com in terms of their rankings as well as their data on computer science research from the past ten years. There have been previous analyses on faculty number and rankings (https://krixly.github.io/) however this projects seeks to find correlations between research quantity and area on rankings.
#### Ulterior Motives?
This analysis also doubles as a tutorial in data science, particularly the data science pipeline. Data science seeks to analyze data to extract meaning. In our case, this meaning comes in the form of understanding how research affects CS rankings. The pipeline we use includes: Data collection and organization, exploratory data analysis, hypothesis testing and machine learning, and finally interpretation and meaning explanation. To aid in this understanding, we have included extra resources in the form of links and definitions.<br>For more information on data science in general, visit: https://en.wikipedia.org/wiki/Data_science<br>For more information on the data science pipeline, visit: https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3

## 2. Data Collection and Organization
The first step in the data science pipeline is data collection and organization. We scraped (Def: data scraping- using code to extract data from another program) data from csrankings.org (see link below). This data contains the raw data used to create this sources Top 20 CS schools rankings. It includes information on professors, their publications, their associated universities, and the area of research. An important aspect of data scraping is its human readability. A csv file (Def: CSV- comma separated values) organizes rows of a data table by separating each data entry with a comma. We read our data into a python data structure called a dataframe in order to more easily work with the data.
<br>Follow the steps below to see how we scraped, cleansed, and organized our data.
<br><br>Data came from: https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/csrankings.csv

In [2]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as display
%matplotlib inline

In [85]:
#read the top 20 rankings from csrankings.org (data was copied from the website to a csv)
top_20_universities_rank = pd.read_csv("top_20_cs_schools.csv")

#remove unnecesary columns
top_20_universities_rank.drop('group',axis = 1,inplace = True)
top_20_universities_rank.drop('faculty',axis = 1,inplace = True)

#get the list of top 20 schools
top_20 = list(top_20_universities_rank['University'])

top_20_universities_rank

Unnamed: 0,Rank,University
0,1,Carnegie Mellon University
1,2,Massachusetts Institute of Technology
2,3,Stanford University
3,4,University of California - Berkeley
4,5,Univ. of Illinois at Urbana-Champaign
5,6,University of Michigan
6,7,Cornell University
7,8,University of Washington
8,9,Georgia Institute of Technology
9,10,University of California - San Diego


In [3]:
url = "https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/csrankings.csv"

In [15]:
prof_names = pd.read_csv(url)
prof_names.head()

Unnamed: 0,name,affiliation,homepage,scholarid
0,A. Aldo Faisal,Imperial College London,https://www.imperial.ac.uk/people/a.faisal,WjHjbrwAAAAJ
1,A. Antony Franklin,IIT Hyderabad,http://www.iith.ac.in/~antony/index.html,LVfqLuoAAAAJ
2,A. C. Cem Say,Boğaziçi University,https://www.cmpe.boun.edu.tr/~say,rOum2XsAAAAJ
3,A. C. W. Finkelstein,University College London,http://www0.cs.ucl.ac.uk/staff/A.Finkelstein,n8xuCVkAAAAJ
4,A. Cüneyd Tantug,Istanbul Technical University,http://tantug.com,TTawdWMAAAAJ


Next, we read data from https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/generated-author-info.csv <br>Potential problem: This data could contain entries we don't need or could have invalid entries (for example, if an entry does not contain a value for publication area). <br>Solution: cleanse the data. Cleansing data is a way to remove any invalid or unwanted data. It is also a way to ensure the data we're using will give us the most relevant and valid results.

In [108]:
#read data from csv file into a pandas dataframe
professor_publication_info_csv = "https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/generated-author-info.csv"
professor_publication_info = pd.read_csv(professor_publication_info_csv)

#drop the professor's name from the table as we don't need this information
professor_publication_info.drop('name',axis = 1,inplace = True)

#drop the count and adjustedcount fields. They are not relevant to our analysis
professor_publication_info.drop('count',axis = 1,inplace = True)
professor_publication_info.drop('adjustedcount',axis = 1,inplace = True)

#rename the column label from 'dept' to 'university' because it better represents how we understand the data
professor_publication_info.rename(columns={"dept":"university"}, inplace=True)

#rename the column label from 'area' to 'research area' because it better represents how we understand the data
professor_publication_info.rename(columns={"area":"research area"}, inplace=True)

professor_publication_info.head()

Unnamed: 0,university,research area,year
0,Imperial College London,icra,2016
1,Istanbul Technical University,acl,2007
2,VU Amsterdam,ijcai,2007
3,Bilkent University,ismb,2014
4,George Mason University,fse,2016


The next step of cleansing the data set is to remove all entries that aren't ranked in the top 20 universities. We show the significance of this step by printing the total number of entries and then the number of remaining entries after we cleanse the unwanted entries. All of that unwanted data would invalidate any data analysis.

In [109]:
print(f'Number of total entries before cleansing by year: {len(professor_publication_info)}')

valid_years = ['2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018']

#first step is to drop all the entries with data older than 2008
professor_publication_info = professor_publication_info[professor_publication_info['year'].isin(valid_years)]

print(f'Number of total entries after cleansing by year: {len(professor_publication_info)}')

Number of total entries before cleansing by year: 101597
Number of total entries after cleansing by year: 61801


In [110]:
#this section of cleansing will remove any data from a university not in the top 20
print(f'Number of total entries after cleansing by uni: {len(professor_publication_info)}')

#dropping anything not related to top 20 universities
professor_publication_info = professor_publication_info[professor_publication_info['university'].isin(top_20)]

print(f'Number of total entries after cleansing by uni: {len(professor_publication_info)}')

Number of total entries after cleansing by uni: 61801
Number of total entries after cleansing by uni: 17593


As we can see, the cleansing was an extremely important to reducing the number of data entries from 101597 to 17593. Now we're left with data relating only to publications at the top 20 universities from the past 10 years (2008-2018)<br><br>The next step is to ensure there is no missing data. We do this by checking the unique values of each column. If there is something unexpected, such as a 'research area' entry which is not one validated on csrankings.com, we should remove that entry. We accomplish this by looking at the output of unique entries and looking for anything invalid.

In [111]:
professor_publication_info['research area'].unique()
#by checking which research areas are used to calculate ranking, we found that:
#pets is not tracked
#chiconf not tracked
print(f'Number of total entries before cleansing untracked research areas: {len(professor_publication_info)}')
publications_including_untracked = professor_publication_info.copy(deep=True)
untracked = ['pets', 'chiconf']
#untracked_publications = untracked_publications[untracked_publications['research area'].isin(untracked)]
professor_publication_info = professor_publication_info[~professor_publication_info['research area'].isin(untracked)]
print(f'Number of total entries after cleansing untracked research areas: {len(professor_publication_info)}')

#untracked_publications.head()

Number of total entries before cleansing untracked research areas: 17593
Number of total entries after cleansing untracked research areas: 16826


When looking for invalid data, we found two areas of research which were not part of the set of areas which were considered in the rankings. So, we removed those entries from the main publication data table. Prior to this, we copied the dataframe so we have a version with all publications including those untracked.<br>We did this because it raises the question of why these research areas are not considered. Later, we will look at how much this missing data affects ranking.<br><br>
Now that the data is cleansed and organized, we're ready to move onto the next phase of the data science pipeline: Exploratory Data Analysis

In [87]:
#
undergraduate_number_cs = pd.read_csv("undergraduate_cs_research.csv")
undergraduate_number_cs.rename(columns={"dept":"university"}, inplace=True)
#
top20_US_Universities = \
professor_publication_info[professor_publication_info['university'].isin(top_20)]
final_dataframe = top20_US_Universities.groupby('university').university.count()
#final_dataframe

## 3. Exploratory Data Anlysis (EDA)
This step of the data science pipeline is for getting a sense of the data and what possible meaning could be extracted from it. This is done through a variety of methods, but most commonly done by visualizing the data in charts and graphs.<br>The NIST Engineering Handbook describes EDA as, "an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to:
1. maximize insight into a data set;
2. uncover underlying structure;
3. extract important variables;
4. detect outliers and anomalies;
5. test underlying assumptions;
6. develop parsimonious models; and
7. determine optimal factor settings."

For a deeper look at more EDA methods and concepts, this NIST handbook is a useful resource (https://www.itl.nist.gov/div898/handbook/eda/section1/eda11.htm)<br><br>
In this step, we will not be answering the question of CS rankings and research, just representing important aspects of the data which we may or may not use for our analysis.<br><br>

## 5. Conclusion
(fill this in with conclusion stuff. I just wanted to write the very end so it ties back to the beginning 'Imagine' section)
<br><br>So...<br>Now you're well versed in both how research influences university rankings and the data science pipeline. With this tutorial you could perform your own analyses on how publication affects other ranking systems (U.S. News and World Report?) or any data driven question. Curious about how poverty in America vs the rest of the world? The effect of temperature on NFL football games? Use these tools to master data and extract meaning from the world around you!<br><br>For a more comprehensive look at data science and machine learning, there are many online classes. Here are a couple highly regarded classses:<br>https://www.class-central.com/course/udacity-intro-to-data-analysis-4937<br>https://www.udemy.com/machinelearning/<br>https://www.udemy.com/data-science-and-machine-learning-with-python-hands-on/