# How does publication number and topic affect a Universities Ranking in Top 20 CS Schools?
### Authors: Mitchell Skopic and Kalyan Kanagala
## Table of Contents
1. Introduction to the Problem
2. Data Collection
3. Exploratory Data Analysis
4. Hypothesis Testing
5. Conclusion

## 1. Introduction to the Problem
### Imagine...
You're a bright, ambitious high school senior preparing to enter college. For years you've had a fascination with computers and the exciting research surrounding them. In short, you want to study and contribute to the field of computer science. But where should you study? Which Universities have research you're interested in? Does research influence the computer science education? What should you do?
### The Problem
There are numerous resources to determine top computer science programs, one of the most famous being U.S. News and World Report's. What factors does U.S. News and World Report use when determining rankings? How is research factored into that? And that is what we aim to find out.<br><br>
**How does research quantitiy and topic influence the rankings of computer science school?**<br><br>
This problem is important for for serveral reasons. One, choosing a university is an extremely important decision, one which requires as much information as possible. Two, one gets out of university what they put in. In other words, if a student seeks an academic track, they would prefer a university with an emphasis on research and publication. If they seek a professional track, these metrics matter less. And three, knowing a schools area of research (and how prevalent their research is in the field) is important when making a decision. A systems research oriented student would not be university searching with the same criteria as a data science oriented student looking to work in industry.
### Overview
To answer these questions and help students make more informed univertsity decisions, we will be analyzing data from csrankings.com in terms of their rankings as well as their data on computer science research from the past ten years. There have been previous analyses on faculty number and rankings (https://krixly.github.io/) however this projects seeks to find correlations between research quantity and area on rankings.
#### Alterior Motives?
This analysis also doubles as a tutorial in data science, particularly the data sceience pipeline. Data science seeks to analyze data to extract meaning. In our case, this meaning comes in the form of understanding how research affects CS rankings. The pipeline we use includes: Data collection and organization, exploratory data analysis, hypothesis testing and machine learning, and finally interpretation and meaning explanation. To aid in this understanding, we have included extra resources in the form of links and definitions.<br>For more information on data science in general, visit: https://en.wikipedia.org/wiki/Data_science<br>For more information on the data science pipeline, visit: https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3

## 2. Data Collection and Organization
The first step in the data science pipeline is data collection and organization. We scraped (Def: data scraping- using code to extract data from another program) data from csrankings.org (see link below). This data contains the raw data used to create this sources Top 20 CS schools rankings. It includes information on professors, their publications, their associated universities, and the area of research. An important aspect of data scraping is its human readability. A csv file (Def: CSV- comma separated values) organizes rows of a data table by separating each data entry with a comma. We read our data into a python data structure called a dataframe in order to more easily work with the data.
<br>Follow the steps below to see how we scraped, cleased, and organizaed our data.
<br><br>Data came from: https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/csrankings.csv

In [2]:
import requests
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import IPython.display as display
%matplotlib inline

In [3]:
url = "https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/csrankings.csv"

In [15]:
prof_names = pd.read_csv(url)
prof_names.head()

Unnamed: 0,name,affiliation,homepage,scholarid
0,A. Aldo Faisal,Imperial College London,https://www.imperial.ac.uk/people/a.faisal,WjHjbrwAAAAJ
1,A. Antony Franklin,IIT Hyderabad,http://www.iith.ac.in/~antony/index.html,LVfqLuoAAAAJ
2,A. C. Cem Say,Boğaziçi University,https://www.cmpe.boun.edu.tr/~say,rOum2XsAAAAJ
3,A. C. W. Finkelstein,University College London,http://www0.cs.ucl.ac.uk/staff/A.Finkelstein,n8xuCVkAAAAJ
4,A. Cüneyd Tantug,Istanbul Technical University,http://tantug.com,TTawdWMAAAAJ


Here, we read data from https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/generated-author-info.csv <br>Potential problem: This data could contain entires we don't need or could have invalid entires (for example, if an entry does not contain a value for publication area). Solution: cleanse the data. Cleansing data is a way to remove any invalid data. It is also a way to ensure the data we're using wil give us the most relevant and valid results.

In [19]:
professor_publication_info_csv = "https://raw.githubusercontent.com/emeryberger/CSrankings/gh-pages/generated-author-info.csv"
professor_publication_info = pd.read_csv(professor_publication_info_csv)
#professor_publication_info.drop('name',axis = 1,inplace = True)
#professor_publication_info.drop('area',axis = 1, inplace = True) #left this in because it will be important
professor_publication_info.head()

#This (below) is not important for us. duplicate professors is fine
#print(len(professor_publication_info['name'].unique()))
#print(len(professor_publication_info['name']))

8836
101597


In [13]:
#
undergraduate_number_cs = pd.read_csv("undergraduate_cs_research.csv")
#
top20_US_Universities = \
professor_publication_info[professor_publication_info['dept'].isin(['Carnegie Mellon University'
                                                                    ,'Massachusetts Institute of Technology'
                                                                    ,'University of California - Berkeley'
                                                                    ,'Stanford University'
                                                                    ,'Univ. of Illinois at Urbana-Champaign'
                                                                    ,'Cornell University'
                                                                    ,'University of Washington'
                                                                    ,'Georgia Institute of Technology'
                                                                    ,'Princeton University'
                                                                    ,'University of Texas at Austin'
                                                                    ,'California Institute of Technology'
                                                                    ,'University of Michigan'
                                                                    ,'Columbia University'
                                                                    ,'University of California - Los Angeles'
                                                                    ,'University of Wisconsin - Madison'
                                                                    ,'Harvard University'
                                                                    ,'University of California - San Diego'
                                                                    ,'University of Maryland - College Park'
                                                                    ,'University of Pennsylvania'
                                                                    ,'Purdue University'
                                                                    ,'Yale University'])]
final_dataframe = top20_US_Universities.groupby('dept').dept.count()
final_dataframe

dept
California Institute of Technology         273
Carnegie Mellon University                3557
Columbia University                       1153
Cornell University                        1704
Georgia Institute of Technology           1763
Harvard University                         644
Massachusetts Institute of Technology     2157
Princeton University                      1136
Purdue University                          883
Stanford University                       2080
Univ. of Illinois at Urbana-Champaign     1883
University of California - Berkeley       2271
University of California - Los Angeles    1013
University of California - San Diego      1443
University of Maryland - College Park     1283
University of Michigan                    1743
University of Pennsylvania                1190
University of Texas at Austin              994
University of Washington                  1638
University of Wisconsin - Madison         1125
Yale University                            487
Name: de

## 3. Exploratory Data Anlysis
This step of the data science pipeline is for getting a sense of the data and what possible meaning could be extracted from it. This is done through a variety of methods, but most commonly done by visualizing the data in charts and graphs. In this step, we will not be answering the question of CS rankings and research, just representing important aspects of the data which we may or may not use for our analysis.<br><br>