<a href="https://colab.research.google.com/github/maryam98/ReDI-School/blob/main/Maryam_Project_2_free_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project - Intro to Pandas [Transformations, missing data, editing]

--- 
## The description:
Hey all. To practice all your newly obtained pandas skills we will have new data!
It is taken from www.kaggle.com, a cool website where you can find plenty of data to explore and play with.


--- 
## The datasets:
The first one - in the code called `university` - is the [QS World University Ranking](https://www.kaggle.com/datasets/padhmam/qs-world-university-rankings-2017-2022). 
This dataset contains university data from the year 2017 to 2022. It has a total of 15 features.

- rank_display - rank given to the university
- score - score of the university based on the six key metrics mentioned above
- link - link to the university profile page on QS website
- country - country in which the university is located
- city - city in which the university is located
- region - continent in which the university is located
- logo - link to the logo of the university
- type - type of university (public or private)
- research_output - quality of research at the university
- studentfacultyratio - number of students assigned to per faculty
- international_students - number of international students enrolled at the university
- size - size of the university in terms of area
- faculty_count - number of faculty or academic staff at the university

The second dataset - in the code called `cost` - is the [Cost of Living Index 2022](https://www.kaggle.com/code/bcruise/cost-of-living-index-2022-eda/notebook) and describes how expensive it is to live in various countries.
It contains numeric columns with various cost of living indices relative to New York City. An index of 123 would mean that this country is 23% more expensive than New York.

--- 
## The Tasks:

1. Get yourself familiar with the data. How many rows, what range do the numeric columns have? Anything else?

2. According to the average university score in these regions, would you prefer to go to a private or public university in Africa? How would you decide if you want to visit a university in Europe?

3. How many universities have the term "University" in their name? Can you find other common name parts?

4. Are the best universities situated in the most expensive countries?

5. Do cheaper countries attract more foreign students than more expensive countries? Do we get a different picture when we look at the regions separately?


BONUS: 
6. Come up with one question you want to answer about the data and answer it.

In [None]:
import pandas as pd
# this is a hack to allow displaying more than one result per notebook cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
university = pd.read_csv("https://raw.githubusercontent.com/sarahlbe/redi/main/qs-world-university-rankings-2017-to-2022-V2.csv")
cost = pd.read_csv("https://raw.githubusercontent.com/sarahlbe/redi/main/cost_of_living_index.csv")

### 1. Get yourself familiar with the data. How many rows, what range do the numeric columns have? Anything else?

In [None]:
university.info()
university.describe()
cost.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6482 entries, 0 to 6481
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   university              6482 non-null   object 
 1   year                    6482 non-null   int64  
 2   rank_display            6414 non-null   object 
 3   score                   2820 non-null   float64
 4   link                    6482 non-null   object 
 5   country                 6482 non-null   object 
 6   city                    6304 non-null   object 
 7   region                  6482 non-null   object 
 8   logo                    6482 non-null   object 
 9   type                    6470 non-null   object 
 10  research_output         6480 non-null   object 
 11  student_faculty_ratio   6407 non-null   float64
 12  international_students  6318 non-null   object 
 13  size                    6480 non-null   object 
 14  faculty_count           6404 non-null   

Unnamed: 0,year,score,student_faculty_ratio
count,6482.0,2820.0,6407.0
mean,2019.693613,46.595532,13.264554
std,1.716683,18.81311,6.604294
min,2017.0,23.5,1.0
25%,2018.0,31.8,9.0
50%,2020.0,40.6,12.0
75%,2021.0,58.025,17.0
max,2022.0,100.0,67.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139 entries, 0 to 138
Data columns (total 8 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   rank                            139 non-null    float64
 1   country                         139 non-null    object 
 2   cost of living index            139 non-null    float64
 3   rent index                      139 non-null    float64
 4   cost of living plus rent index  139 non-null    float64
 5   groceries index                 139 non-null    float64
 6   restaurant price index          139 non-null    float64
 7   local purchasing power index    139 non-null    float64
dtypes: float64(7), object(1)
memory usage: 8.8+ KB


### 2. According to the average university score, would you prefer to go to a private or public university in Africa? How would you decide if you want to visit a university in Europe?

In [None]:
university.head(100)

Unnamed: 0,university,year,rank_display,score,link,country,city,region,logo,type,research_output,student_faculty_ratio,international_students,size,faculty_count
0,Massachusetts Institute of Technology (MIT),2017,1,100.0,https://www.topuniversities.com/universities/m...,United States,Cambridge,North America,https://www.topuniversities.com/sites/default/...,Private,Very High,4.0,3730,M,3065
1,Stanford University,2017,2,98.7,https://www.topuniversities.com/universities/s...,United States,Stanford,North America,https://www.topuniversities.com/sites/default/...,Private,Very High,3.0,3879,L,4725
2,Harvard University,2017,3,98.3,https://www.topuniversities.com/universities/h...,United States,Cambridge,North America,https://www.topuniversities.com/sites/default/...,Private,Very High,5.0,5877,L,4646
3,University of Cambridge,2017,4,97.2,https://www.topuniversities.com/universities/u...,United Kingdom,Cambridge,Europe,https://www.topuniversities.com/sites/default/...,Public,Very high,4.0,7925,L,5800
4,California Institute of Technology (Caltech),2017,5,96.9,https://www.topuniversities.com/universities/c...,United States,Pasadena,North America,https://www.topuniversities.com/sites/default/...,Private,Very High,2.0,692,S,968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,University of Geneva,2017,95,63.6,https://www.topuniversities.com/universities/u...,Switzerland,Geneva,Europe,https://www.topuniversities.com/sites/default/...,Public,Very High,9.0,6547,L,1814
96,KTH Royal Institute of Technology,2017,97,63.1,https://www.topuniversities.com/universities/k...,Sweden,Stockholm,Europe,https://www.topuniversities.com/sites/default/...,Public,Very High,8.0,3057,L,3600
97,Uppsala University,2017,98,62.8,https://www.topuniversities.com/universities/u...,Sweden,Uppsala,Europe,https://www.topuniversities.com/sites/default/...,Public,Very High,10.0,8401,L,2843
98,Korea University,2017,98,62.8,https://www.topuniversities.com/universities/k...,South Korea,Seoul,Asia,https://www.topuniversities.com/sites/default/...,Private,Very High,6.0,3325,L,4026


In [None]:
university.loc[(university.score >= university.score.mean()) & (university.region=='Africa'),['type']].groupby('type').size()

type
Public    2
dtype: int64

### 3. How many universities have the term "University" in their name? Can you find other common name parts?

In [None]:
from collections import Counter
MyList=university.university.str.split().sum()
c=Counter(MyList)
c.most_common()


[('University', 4621),
 ('of', 2366),
 ('de', 671),
 ('Universidad', 536),
 ('Technology', 477),
 ('National', 285),
 ('Institute', 221),
 ('State', 221),
 ('The', 214),
 ('Université', 193),
 ('and', 192),
 ('Universidade', 157),
 ('Science', 154),
 ('-', 125),
 ('Universität', 116),
 ('Federal', 114),
 ('Technical', 112),
 ('Nacional', 94),
 ('London', 90),
 ('Universiti', 80),
 ('del', 78),
 ('Católica', 71),
 ('College', 61),
 ('San', 59),
 ('New', 55),
 ('do', 55),
 ('School', 54),
 ('California,', 53),
 ('Indian', 53),
 ('South', 52),
 ('in', 52),
 ('Polytechnic', 52),
 ('at', 51),
 ('Malaysia', 51),
 ('Autónoma', 49),
 ('the', 47),
 ('di', 44),
 ('King', 44),
 ('&', 43),
 ('Universitat', 43),
 ('Hong', 42),
 ('Kong', 42),
 ('Central', 42),
 ('China', 41),
 ('Pontificia', 41),
 ('Paris', 40),
 ('Università', 40),
 ('Economics', 39),
 ('City', 39),
 ('Research', 38),
 ('American', 38),
 ('Southern', 37),
 ('Beijing', 37),
 ('Business', 37),
 ('University,', 36),
 ('Tokyo', 36),
 (

### 4. Are the best universities situated in the most expensive countries?

In [None]:
top_unies=university.loc[pd.to_numeric(university.rank_display ,errors='coerce') <=10,'country' ].unique()
top_expensive_countries=cost.sort_values(by=['cost of living plus rent index'],ascending=False)['country'].head(10).unique()
display(top_unies,top_expensive_countries)
print('********top uni and top expensive*********')
set(top_unies) & set(top_expensive_countries)


array(['United States', 'United Kingdom', 'Switzerland'], dtype=object)

array(['Bermuda', 'Switzerland', 'Jersey', 'Hong Kong', 'Singapore',
       'Luxembourg', 'Iceland', 'Norway', 'Guernsey', 'Israel'],
      dtype=object)

********top uni and top expensive*********


{'Switzerland'}

### 5. Do cheaper countries attract more foreign students than more expensive countries? Do we get a different picture when we look at the regions separately?

In [None]:
top_cheaper_countries=cost.sort_values(by='cost of living plus rent index')['country'].head(10).unique()
top_cheaper_countries

array(['Afghanistan', 'Pakistan', 'India', 'Algeria', 'Nepal', 'Tunisia',
       'Syria', 'Colombia', 'Kosovo (Disputed Territory)', 'Turkey'],
      dtype=object)

In [None]:
university['international_students']=university['international_students'].map(lambda x: float(str(x).replace(',','')))
more_foreign_students=university.sort_values(by='international_students',ascending=False)['country'].head(10)
more_foreign_students


5240    Australia
1968    Australia
4051    Australia
2985    Australia
992     Australia
64      Australia
1982    Argentina
4064    Argentina
5250    Argentina
84      Argentina
Name: country, dtype: object

In [None]:
set(top_cheaper_countries) & set(more_foreign_students)

set()

## Review

This section is only for the reviewing team!

Guideline how to review: https://docs.google.com/presentation/d/1YORFwlfVQo9ogj7jR9t6_pqxmGIlpBSGubbp1UdtcBQ/edit?usp=sharing

## Review Criteria:
<h3><input type="checkbox"> 1. Are all questions answered? <br></h3> 
<h3><input type="checkbox"> 2. Does all code run through? <br></h3> 
<h3><input type="checkbox"> 3. Are the conclusions understandable?  <br></h3> 
<h3><input type="checkbox"> 4. Is the bonus question answered?  <br></h3> 