# Alex's data quest

In [132]:
import pandas as pd

In [133]:
#data = pd.read_csv('data.csv', error_bad_lines=False, encoding="utf-8") # CSV old way
data = pd.read_json("data_5scheduler.json") # json new way

Look at the data:

In [134]:
print(data.describe(include="object"))

                title   identifier description  source instructors offered  \
count            4443         4443        4443    4443        4443    4443   
unique           3761         4234        3989       5        1231     139   
top     Senior Thesis  PHYS-178-KS              Pomona          []           
freq               56            2          95    1446        2023    1559   

       prerequisites corequisites  
count           4443         4443  
unique           716           28  
top                                
freq            3359         4411  


### Duplicates:
It looks like there is only 3989 unique course descripition so let's remove duplicates based on 'description' column.
There are also rows with empty descriptions, which are not helpful

In [135]:
print(len(data))
data = data.drop_duplicates(subset='description')
data = data[data["description"] != ""]
print(len(data))

4443
3988


In [136]:
data.head(10)

Unnamed: 0,title,identifier,description,source,credits,instructors,offered,prerequisites,corequisites,currently_offered,fee
0,Introduction to American Cultures,AMST-103-HM,An interdisciplinary introduction to principal...,HarveyMudd,300,[Staff],,,,False,0
1,Print and American Culture,AMST-115-HM,Covers numerous developments in American print...,HarveyMudd,300,[Anup Gampa],,,,True,0
2,Hyphenated Americans,AMST-120-HM,A focus on the experience of immigrants in the...,HarveyMudd,300,[Balseiro],,,,False,0
3,"Life: Knowledge, Belief, and Cultural Practices",ANTH-110-HM,An exploration of cultural attitudes toward li...,HarveyMudd,300,[de Laet],,,,False,0
4,Introduction to the Anthropology of Science an...,ANTH-111-HM,An introduction to science and technology as c...,HarveyMudd,300,[Marianne De Laet],,,,True,0
5,War and Conflict,ANTH-115-HM,“The wings of the butterfly—that cause the hur...,HarveyMudd,300,[de Laet],,,,False,0
6,Rationalities,ANTH-134-HM,What does it mean to be rational? Does it mean...,HarveyMudd,300,[de Laet],Offered alternate years,Any introductory course in anthropology or any...,,False,0
7,A History of Landscape Photography,ARHI-131-HM,This course explores how photographic landscap...,HarveyMudd,300,[Fandell],,,,False,0
8,Modern and Contemporary Art Practices,ART-002-HM,This class is an experimental lecture style ar...,HarveyMudd,300,[Fandell],,,,False,0
9,Photography,ART-033-HM,Approaching the medium from an artistic perspe...,HarveyMudd,300,[Fandell],,ART002 HM,,False,150


### tf-idf with scikit-learn
[Description](https://monkeylearn.com/blog/what-is-tf-idf/)

[Usage](https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.Y1M42ezMJhF)

Here is an example of how TfIdf would work if our documents were the following 4 sentences:

In [137]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]
vectorizer = TfidfVectorizer(use_idf=True)
vectors = vectorizer.fit_transform(corpus)
firstv = vectors[0]
df = pd.DataFrame(firstv.T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"])
df = df.sort_values(by=["tfidf"], ascending = False)
print("TfIdf values for the first sentence")
print(df)


TfIdf values for the first sentence
             tfidf
first     0.580286
document  0.469791
is        0.384085
the       0.384085
this      0.384085
and       0.000000
one       0.000000
second    0.000000
third     0.000000




In the example above we can see the importance of each word ranked for the first sentence `'this is the first document'`. So, for example the word `first` is important since it doesn't appear in any other document. The word `the` is not as important since it appears in all other documents. And the word `third` is not important at all since it doesn't even appear in the first document.

Function `tfidf(word, data)` takes in the word we are interested in and the data we are looking at. The function returns an updated dataframe with a new column `"score"` that gives each class a score of importance based on the input word. 

In [138]:
def tfidf(word, data):
    print(data.loc[0, "description"])
    corpus = list(data.description)
    vectorizer = TfidfVectorizer(use_idf=True)
    vectors = vectorizer.fit_transform(corpus)

    score_for_word = []
    words = vectorizer.get_feature_names()
    try:
        index = words.index(word)
    except:
        print("'" + word + "'" + " is not mentioned in any course descriptions")
        return

    for i in range(0, len(corpus)):
        value = vectors[i].T.todense()[index]
        score_for_word.append(value)

    score_for_word = [float(i) for i in score_for_word] # type cast each score to a float

    data["score"] = score_for_word
    data = data.sort_values(by=["score"], ascending = False)
    return data

For example, let's say we are interested in ranking all of the classes based on the word `computer`:

In [139]:
tfidf('computer', data).head(20)

An interdisciplinary introduction to principal themes in American culture taught by an intercollegiate faculty team.




Unnamed: 0,title,identifier,description,source,credits,instructors,offered,prerequisites,corequisites,currently_offered,fee,score
132,Computer Science Seminar,CSCI-181-HM,Advanced topics of current interest in compute...,HarveyMudd,0,[Staff],Fall and Spring,Permission of instructor,,False,0,0.442194
525,Special Topics in Computer Science,CSCI-181-CM,Selected topics in computer science. May be re...,ClaremontMckenna,100,[],Occasionally,,,False,0,0.431867
1490,Computer Science Colloquium,CSCI-188-PO,Colloquium presentations and discussions of to...,Pomona,0,[Joseph C Osborn],Each semester.,"CSCI 051A PO , or CSCI 051G PO , or CSCI 051J ...",,True,0,0.422261
426,Introduction to Computational Neuroscience,BIOL-133L-KS,This course provides computational skills for ...,ClaremontMckenna,100,[],Every fall,,,False,0,0.342191
1491,Computer Science Senior Seminar,CSCI-190-PO,"Reading, discussion and presentation of resear...",Pomona,25,[Joseph C Osborn],Each semester.,Senior standing and two CSCI core courses (inc...,,True,0,0.33416
1060,Computational Physics and Engineering,PHYS-100-KS,This course is a comprehensive introduction to...,ClaremontMckenna,100,[Scot Gould],Every spring,,,True,0,0.327428
518,Fundamentals of Computer Science,CSCI-052-CM,"A solid foundation in functional programming, ...",ClaremontMckenna,100,[],Occasionally,,,False,0,0.314859
3358,Computational Physics and Engineering,PHYS-100-KS,This course is a comprehensive introduction to...,Scripps,100,[],Every spring,"PHYS033L KS , PHYS034L KS ; or PHYS030L KS ,...",,False,0,0.311172
137,Computer Science Colloquium,CSCI-195-HM,Oral presentations and discussions of selected...,HarveyMudd,50,[Melissa E. O'Neill],Fall and Spring,Juniors and seniors only,,True,0,0.276373
106,Introduction to Biology and Computer Science,CSCI-005GR-HM,This course introduces fundamental concepts fr...,HarveyMudd,300,"[Wu, Bush (Biology)]",Fall,,,False,0,0.257451


These are the first 20 instances of the classes that are most related to the word `computer` ranked in descending order. So, we could recomend a student who is interested in `computer` to take these classes.