Here we import the basic libraries required for data analysis with Python!

In [1]:
#Here we import the basic Python data analytics libraries
import warnings
warnings.filterwarnings("ignore")

import pandas as pd   # includes tools used in reading data
import numpy as np   # includes tools for numerical calculus
import matplotlib.pyplot as plt  # includes tools used in plotting data

Below we define a function that we can use to track how much time and memory the code uses between a certain part of running the code.

In [2]:
import psutil
import time

def track_memory_and_time(start_time=None):
    """
    Track memory usage and elapsed time since a specific point in the code.

    Parameters:
    - start_time: Optional parameter. If provided, it should be the result of a previous call to time.time().

    Returns:
    - memory_usage: Current memory usage in bytes.
    - elapsed_time: Elapsed time in seconds since the specified start time.
    """

    # Get memory usage
    memory_usage = psutil.virtual_memory().used

    # Get elapsed time
    current_time = time.time()
    elapsed_time = current_time - start_time if start_time else 0

    return memory_usage, elapsed_time

In [3]:
#Defining and creating a function with which you can later on get the accuracy of the ML/AI algorithm
def accuracy(clf, X_test, y_test):
    score = str(clf.score(X_test, y_test)*100)+'%'

    return score

In [4]:
#this is just defining and creating a function with which you can later on compare the AI predictions to the original data
def find_differences(clf, X_test, y_test):
    
    predictions = []
    test_labels = []
    column_names = list(X_test.columns.values)
    events = pd.DataFrame(columns=list(X_test.columns.values))
    
    for i in range(len(list(y_test))-1):
        if str(clf.predict([X_test.values[i+1]])[0]) != list(y_test)[i+1]:
            predictions.append(clf.predict([X_test.values[i+1]])[0])
            test_labels.append(list(y_test)[i+1])
            events.loc[len(events)] = X_test.values[i+1]
               
    df = pd.DataFrame(columns=['Prediction', 'Original label'])
    df['Prediction']= predictions
    df['Original label']= test_labels

    df = pd.concat([df, events], axis=1)
    return df

In [5]:
#Move this block to the location in the code from where you want to start tracking
start_time = time.time()

Below are some different types of machine learning algorithms. Acquaint yourself with the basic pros and cons of each. If you wish to, you can also find other types of ML to learn about at scikit-learn.org.

__[Neural Network](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)__

__[Support Vector Machine](https://scikit-learn.org/stable/modules/svm.html#svm)__

__[Nearest Neighbours](https://scikit-learn.org/stable/modules/neighbors.html)__

__[Decision Tree](https://scikit-learn.org/stable/modules/tree.html#tree)__

In [6]:
#Here we import the different ML algorithms from the scikit-learn library

from sklearn.neural_network import MLPClassifier
#Neural Network

from sklearn import svm
#Support Vector Machine

from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier 
#Nearest Neighbours

from sklearn import tree
#Decision tree

Below we define a function with which we can choose the machine learning algorithm to choose by calling its name later on in the notebook. Some specifics are given to the algorithms such as iterations to use, how to scale data. If you want to get acquainted with the different such options, use the links above to learn about the parameters and their effects.

However, <b>consider: Who would know which parameters to use and why if you were to implement AI in your organization?</b>

In [7]:
def choose_algorithm(name):
    if name == "NN":
        clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
    elif name == "KNN":
        from sklearn.pipeline import Pipeline
        from sklearn.preprocessing import StandardScaler
        clf = Pipeline(steps=[("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=11))])
    elif name == "SVM":
        clf = svm.SVC()
    else:
        clf = tree.DecisionTreeClassifier()
        
    return clf

It's time to think of the data. Let's begin with quantitative and get to qualitative later on...

## Quantitative

For supervised machine learning, you need a dataset and correct labels to give to it.

<b>For your work, think of an AI task that would be useful for you and determine the type of data you would need to teach to an AI: Where or how can you get 10k+ examples with the correct "labels" assigned to the datapoints - what resources would you need to create/retrieve this data?</b>

I will be making good use of CERN's open data repositories to demonstrate the ML process, and you can as well, if you so wish for this session:

Find two datasets available at __[CERN's open data repository](http://opendata.cern.ch/search?page=1&size=20&keywords=education)__ that have candidates or events for two different particle phenomena. 

In [8]:
#Find two datasets and load them with the link:
#For example, locating a dataset of Z to two muons and a dataset of J/psi to two electrons.

Z2mu = pd.read_csv('http://opendata.cern.ch/record/307/files/Zmumu.csv')
Psi2e = pd.read_csv('http://opendata.cern.ch/record/302/files/dielectron-Jpsi.csv')

Let us then study some characteristics of the found datasets.

In [9]:
#Printing out the number of datapoints in the datasets and the dimension of the features measured in them.
print(Z2mu.shape)
print(Psi2e.shape)
#Printing out an example of the data in the beginning of it: What are the included characteristics and what do they look like?
Z2mu.head()

(2304, 20)
(2000, 19)


Unnamed: 0,Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
0,GT,148031,10507008,82.201866,-41.195288,17.433244,-68.964962,44.7322,-1.21769,2.74126,1,60.621875,34.144437,-16.119525,-47.426984,38.8311,-1.05139,-0.440873,-1,82.462692
1,TT,148031,10507008,62.344929,35.11805,-16.570362,-48.775247,38.8311,-1.05139,-0.440873,-1,82.201866,-41.195288,17.433244,-68.964962,44.7322,-1.21769,2.74126,1,83.626204
2,GT,148031,10507008,62.344929,35.11805,-16.570362,-48.775247,38.8311,-1.05139,-0.440873,-1,81.582778,-40.883323,17.299297,-68.447255,44.7322,-1.21769,2.74126,1,83.308465
3,GG,148031,10507008,60.621875,34.144437,-16.119525,-47.426984,38.8311,-1.05139,-0.440873,-1,81.582778,-40.883323,17.299297,-68.447255,44.7322,-1.21769,2.74126,1,82.149373
4,GT,148031,105238546,41.826389,22.783582,15.036444,-31.689894,27.2981,-0.990688,0.583351,1,49.760726,-20.177373,-9.354149,44.513955,21.8913,1.44434,-2.70765,-1,90.469123


In [10]:
#Printing out an example of the data in the beginning of it: What are the included characteristics and what do they look like?
Psi2e.head()

Unnamed: 0,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
0,147390,543767492,24.521,3.89773,-16.1627,-18.0237,16.626,-0.939585,-1.33416,-1,9.36334,4.20606,-5.50359,-6.30014,6.92679,-0.815936,-0.918244,-1,4.62593
1,147390,551904480,42.8325,-16.4724,4.63309,-39.266,17.1116,-1.56816,2.86741,1,10.488,-3.39472,-0.164604,-9.92201,3.39871,-1.79263,-3.09314,-1,2.9906
2,147390,286521299,78.6993,20.7346,-22.7603,72.4267,30.7889,1.59096,-0.831937,-1,19.1186,5.52052,-3.86481,17.8916,6.73891,1.70329,-0.61078,-1,3.56757
3,147390,348830108,35.7096,-12.6783,10.2126,-31.7827,16.2799,-1.42208,2.4635,1,15.7418,-4.55461,2.89711,-14.7873,5.39793,-1.73266,2.57506,-1,3.10446
4,147390,348839604,12.8308,-9.97245,-5.51779,-5.89352,11.3972,-0.496456,-2.63622,-1,20.4744,-18.0386,-2.92818,-9.23231,18.2747,-0.485855,-2.98067,1,4.94889


<b>Check if the datasets are compatible: Do they have the same number of variables, etc. If you need to do some manipulating, do that.</b>

In [11]:
Z2mu = Z2mu.drop(['Type'], axis = 1)
#Z2mu.drop(['Run', 'Event'], axis = 1) ?
Z2mu = Z2mu.assign(Phenomenon = ['Z to two muons'] * len(Z2mu))
#Psi2e.drop(['Run', 'Event'], axis = 1) ?
Psi2e = Psi2e.assign(Phenomenon = ['J/psi to two electrons'] * len(Psi2e))
df = pd.concat([Z2mu, Psi2e])
df = df.dropna(axis=1)
df = df.sample(frac=1) #Randomization
df.head()

Unnamed: 0,Run,Event,E1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M,Phenomenon
451,146514,512900945,66.6536,-18.4372,-63.7711,19.3895,-1.90606,-1.8855,1,19.8335,0.503777,-6.54009,-18.7174,6.55946,-1.77107,-1.49392,1,4.64492,J/psi to two electrons
1019,147222,65283400,40.5566,11.0375,-36.7004,17.2603,-1.49873,0.69381,1,12.9571,2.29937,5.55609,-11.4773,6.01309,-1.40206,1.17841,1,4.98707,J/psi to two electrons
1114,147284,68429848,11.0119,1.9211,-10.2157,4.11114,-1.6416,2.65537,1,50.3987,-13.1316,4.33679,-48.4642,13.8292,-1.96695,2.82261,1,2.76723,J/psi to two electrons
266,148031,197673445,131.451509,-23.254379,126.455559,35.8954,1.97199,-0.704742,-1,28.163231,-24.930048,12.343262,4.392381,27.9172,0.157219,2.68182,1,90.723758,Z to two muons
641,146804,199305851,49.2803,-9.82289,-47.1618,14.2941,-1.9091,-2.38397,1,9.52495,-1.26145,-3.19019,-8.88573,3.43054,-1.68021,-1.94735,-1,3.4323,J/psi to two electrons


<b>What are the cleaning operations performed above on the CERN datasets? Try to search for what they do and think about why they should/should not be performed.</b>

In [12]:
#what variables do you wish to use for prediction from the dataset you have built? What should you leave out if anything?
X = df[['Run', 'Event', 'E1', 'py1', 'pz1', 'pt1', 'eta1', 'Q1', 'E2', 'px2', 'py2','pz2','pt2','eta2','phi2','Q2','M']]
y =  df['Phenomenon']

Then we split the found data to training and testing data. 
Sometimes we also use validation data. 

<b>What are common splits in percentages and what would be applicable here?</b>

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)

In [14]:
#Let's see what the testing data looks like
X_test.head()

Unnamed: 0,Run,Event,E1,py1,pz1,pt1,eta1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
288,148862,278711112,19.1354,-14.8067,-11.4867,15.3042,-0.693594,1,11.0219,1.12826,-6.45562,-8.86193,6.55347,-1.10991,-1.39777,-1,4.28027
724,146804,621991115,26.6221,12.4828,22.8801,13.6102,1.2912,1,5.9777,1.91029,3.59436,4.3777,4.07046,0.933752,1.08229,-1,2.73793
80,147390,101369049,58.6116,-7.26073,56.6988,14.8516,2.04952,1,9.52965,-3.09616,0.162118,9.01119,3.1004,1.78844,3.08928,-1,4.16847
1105,148031,524869823,58.655355,22.773211,-12.407713,57.3279,-0.214779,1,46.390126,26.326951,17.079124,-34.164716,31.3222,-0.942571,0.575468,-1,81.157416
636,148031,386108314,39.002286,-37.885555,-0.561573,38.9981,-0.0144,1,70.676846,-18.590527,22.322908,-64.43047,29.067,-1.53704,2.26522,-1,86.464321


Now determine which ML algorithm you want to try from the ones you studied briefly earlier.

In [15]:
    #'NN' - Neural Network
    #'KNN' - Nearest Neighbours
    #'SVM' - Support Vector Machine
    #'Tree' - Decision Tree
name = "KNN"    
clf = choose_algorithm(name).fit(X_train, y_train)

In [16]:
'''
Move this block to where you want to stop the
tracking of used time and memory of the code.
'''

#start_time = time.time()
# Call the function to get memory usage and elapsed time
memory_used, time_elapsed = track_memory_and_time(start_time)

print(f"Memory Used: {memory_used} bytes")
print(f"Time Elapsed: {time_elapsed} seconds")

Memory Used: 10277777408 bytes
Time Elapsed: 1.4398720264434814 seconds


The block below prints out the accuracy of the ML algorithm on the test set.
If used algorithm is the decision tree, it also prints out the visualization of the tree.

<b>How could you check the accuracy? Think of a strategy depending on the used algorithm.</b>

Consider cases with correct labels in the training data and cases with unclear cases in the training data.

In [17]:
print("The accuracy of the algorithm is: ")
print(accuracy(clf, X_test, y_test))
if name == "Tree":
    tree.plot_tree(clf)

The accuracy of the algorithm is: 
97.4907063197026%


In [18]:
print("Instances in which the algorithm prediction diverged from the original label were: ")
find_differences(clf, X_test, y_test).head()

Instances in which the algorithm prediction diverged from the original label were: 


Unnamed: 0,Prediction,Original label,Run,Event,E1,py1,pz1,pt1,eta1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
0,J/psi to two electrons,Z to two muons,148031.0,607496200.0,8.301393,1.625099,8.128183,1.68363,2.27809,-1.0,4.385337,0.92004,0.329448,4.273758,0.977246,2.18148,0.343855,1.0,1.75908
1,J/psi to two electrons,Z to two muons,148031.0,223702008.0,4.883931,-1.107473,4.255405,2.3944,1.33933,1.0,56.768048,23.360886,-18.346246,-48.376486,29.7038,-1.26406,-0.665738,-1.0,32.012451
2,J/psi to two electrons,Z to two muons,148031.0,124112566.0,6.183699,1.220354,-5.957006,1.65561,-1.99232,-1.0,29.909979,23.700799,9.743632,-15.424962,25.6255,-0.570486,0.390047,1.0,14.676876
3,J/psi to two electrons,Z to two muons,148031.0,124112566.0,6.914421,2.343046,-6.380242,2.66281,-1.60792,-1.0,6.03289,-0.647688,0.852445,-5.936197,1.07059,-2.41404,2.22054,-1.0,1.438029
4,J/psi to two electrons,Z to two muons,148029.0,392623569.0,46.574519,19.53441,13.530449,44.5657,0.299126,-1.0,6.636212,-1.530955,2.065084,-6.117169,2.57068,-1.60156,2.20874,-1.0,28.734711


Consider the following questions:

<b>Why did the ML algorithm label these events as it did?</b>

<b>Are there misclassification due to error in the training dataset, a wrong label slipped to the wrong place?</b>

<b>Who could you ask to know? How would you have access to them? How much time or money would that cost for you to find out?</b>

<b>Are the answers different to the above questions with different algorithms?</b>

## Qualitative

For supervised machine learning, you need a dataset and correct labels to give to it.

<b>For your work, think of an AI task using qualitative data, e.g., text of images, that would be useful for you and determine the type of data you would need to teach to an AI: Where or how can you get 10k+ examples with the correct "labels" assigned to the datapoints - what resources would you need to create/retrieve this data?</b>

The example here uses text pieces.

In [19]:
'''
Importing the Python libraries to analyze and treat text data.
'''
import re
import spacy
spacy_model = "en_core_web_sm"

Opening up the file with the data, this is the example, if you have different data in mind, you can use that...

In [20]:
df = pd.read_excel('cern_news_data.xlsx')
#Choosing the columns we want for our analysis
df = df[['Document', 'Label']]
#Dropping rows with empty data
df = df.dropna(how = 'any',axis = 0).reset_index(drop = True)
#Dropping rows of duplicate text documents
df = df.drop_duplicates(subset="Document")
df.head()

Unnamed: 0,Document,Label
0,"Founded in 2004, Zecotek operates three divisi...",Human capital
1,Pakistan has a long tradition of international...,Human capital
2,Berners-Lee worked at the European Organizati...,Human capital
3,"In others, it is impossible to remember life b...",Human capital
4,Doha \nTWO Qatari electrical engineering stud...,Human capital


<b>What operations do you need to perform on the data you have to clean it up?</b>

In [21]:
#Making a new list of text documents with extra whitespace and asterisks and quotation marks 
#that would complicate further cleansing removed
cleaner_documents = [text.replace("*", " ").replace('"','') for text in list(df['Document'])]
clean_documents = [re.sub('[\s+]', ' ',text) for text in cleaner_documents]

#Adding the cleansed documents to the dataframe
df['Document Clean'] = clean_documents

In [22]:
#Seeing what the cleansed documents look like
for clean_document in clean_documents[:5]:
    print(clean_document+"\n")

Founded in 2004, Zecotek operates three divisions: Imaging Systems, Optronics Systems and 3D Display Systems with labs located in Canada, Korea, Russia, Singapore and U.S.A. The management team is focused on building shareholder value by commercializing over 50 patented and patent pending novel photonic technologies directly and through strategic alliances with Hamamatsu Photonics (Japan), the European Organization for Nuclear Research (Switzerland), Shanghai EBO Optoelectronics Technology Co. (China), NuCare Medical Systems (South Korea), the University of Washington (United States), and National NanoFab Center (South Korea). For more information visit www.zecotek.com and follow @zecotek on Twitter.

Pakistan has a long tradition of international scientific collaborations. In addition to being actively involved in IAEA's activities, for decades Pakistan has been contributing and regularly participating in European Organization for Nuclear Research's projects, theoretical and nuclear e

For language to be understood by computers, it needs to be turned into a numerical form. That begins with separating the words in a document into units called tokens. Here, the example uses simple tokens of singular words which are lemmatized: For the words the "base" form is found out with a statistical approach.

<b>There are different ways to do tokenization (e.g., n-grams) and lemmatization (e.g., stemming or doing nothing). Quickly search for the benefits of the different approaches and decide if you want to choose "True" or "False" for the lemmatization variable below.</b>

In [23]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 330.3 kB/s eta 0:00:39
      --------------------------------------- 0.2/12.8 MB 1.5 MB/s eta 0:00:09
     - -------------------------------------- 0.6/12.8 MB 3.7 MB/s eta 0:00:04
     --- ------------------------------------ 1.1/12.8 MB 5.2 MB/s eta 0:00:03
     ----- ---------------------------------- 1.7/12.8 MB 6.8 MB/s eta 0:00:02
     ------- -------------------------------- 2.4/12.8 MB 8.1 MB/s eta 0:00:02
     --------- ------------------------------ 3.0/12.8 MB 8.7 MB/s eta 0:00:02
     ---------- ----------------------------- 3.4/12.8 MB 8.7 MB/s eta 0:00:02
     ------------- -------------------------- 4

In [24]:
#Lemmatization:
lemmatized = True

#Loading a python library for natural language processing
nlp = spacy.load(spacy_model, disable=['merge_noun_chunks'])

#Creating a function that will dothe basic tokenization of the documents
def basic_tokenizer(document, lemmatized=lemmatized):
    #Converting the text document into a Spacy document
    document = nlp(document)
    if not lemmatized:
        tokenized = [token.text.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                     and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                              or token.like_url or token.like_email)]
    if lemmatized:
        tokenized = [token.lemma_.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                     and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                              or token.like_url or token.like_email)]       
    #Returns a list of tokens
    return tokenized

#Initializing a list where to add the treated documents
tokenized_documents = []

for document in clean_documents:
    #using basic tokenizer on the document with Spacy's chunks disabled
    tokenized_documents.append(basic_tokenizer(document))

#Adding the tokenized documents to the dataframe
df['Tokenized'] = tokenized_documents

#printing a few examples of what the treated documents look like now
for i in range(5):
    print("Cleansed: "+clean_documents[i]+"\n")
    print("Tokenized: "+str(tokenized_documents[i])+"\n")

Cleansed: Founded in 2004, Zecotek operates three divisions: Imaging Systems, Optronics Systems and 3D Display Systems with labs located in Canada, Korea, Russia, Singapore and U.S.A. The management team is focused on building shareholder value by commercializing over 50 patented and patent pending novel photonic technologies directly and through strategic alliances with Hamamatsu Photonics (Japan), the European Organization for Nuclear Research (Switzerland), Shanghai EBO Optoelectronics Technology Co. (China), NuCare Medical Systems (South Korea), the University of Washington (United States), and National NanoFab Center (South Korea). For more information visit www.zecotek.com and follow @zecotek on Twitter.

Tokenized: ['found', 'operate', 'division', 'lab', 'locate', 'management', 'team', 'focus', 'build', 'shareholder', 'value', 'commercialize', 'patented', 'patent', 'pende', 'novel', 'photonic', 'technology', 'directly', 'strategic', 'alliance', 'information', 'visit', 'follow', 

After the documents have been tokenized and treated as still words, it's relevant to turn the text into numeric form that computer algorithms can understand. Two common ways are TFIDF and BOW vectorizations. Acquaint yourself with both, and choose which one you want to use.

Do you want to use __[TFIDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#)__ or __[Bag-of-Words](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)__ vectorization?

<b>Explain why you decided as you did</b>

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#change this to False if you want to use Baf-of-words
tf_idf_vectorization = True

#the dummy function that returns the already tokenized document
def id_fun(already_tokenized):
    return already_tokenized

#initializing tf-idf
if tf_idf_vectorization:
    vectorizer = TfidfVectorizer(
        analyzer='word',
        tokenizer=id_fun,
        preprocessor=id_fun,
        token_pattern=None)
    
#initializing bag-of-words
else:
    vectorizer = CountVectorizer(
        analyzer='word',
        tokenizer=id_fun,
        preprocessor=id_fun,
        token_pattern=None)
    
#implementing the vectorization
vectorized = vectorizer.fit_transform(tokenized_documents)
#tweaking the form of the data for analysis
dense = vectorized.todense()

In [26]:
'''
Making a simple function that will name all the columns 
in the dataset of the vectorized documents
'''

def name_x(dense, doc):
    shape = dense.shape
    # Generate column names with running numeration
    column_names = [f'x_{i+1}' for i in range(shape[1])]

    data = dense

    df = pd.DataFrame(data, columns=column_names)
    df['Clean'] = doc
    
    return df

In [27]:
'''
Defining which part of the data is the data and which one is the label,
using the function defined just previously
'''

X = name_x(dense, df['Document Clean'])
y =  df['Label']

#printing out a small example of what the data looks like
X.head()

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10,...,x_4092,x_4093,x_4094,x_4095,x_4096,x_4097,x_4098,x_4099,x_4100,Clean
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Founded in 2004, Zecotek operates three divisi..."
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Pakistan has a long tradition of international...
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Berners-Lee worked at the European Organizati...
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"In others, it is impossible to remember life b..."
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Doha TWO Qatari electrical engineering stude...


Then we split the found data to training and testing data. 
Sometimes we also use validation data. 

<b>What are common splits in percentages and what would be applicable here compared to the quantitative approach?</b>

In [28]:
X_Train, X_Test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)

In [29]:
#Removing the non-numerical column from the data before passing this on to the ML algorithms
X_train = X_Train.drop(['Clean'], axis =1)
X_test = X_Test.drop(['Clean'], axis =1)

In [30]:
#Move this block to the location in the code from where you want to start tracking
start_time = time.time()

Now determine which ML algorithm you want to try from the ones you studied briefly earlier.

In [31]:
#Choose the Ml algorithm you want to try:
    #'NN' - Neural Network
    #'KNN' - Nearest Neighbours
    #'SVM' - Support Vector Machine
    #'Tree' - Decision Tree
name = "NN"
clf = choose_algorithm(name).fit(X_train, y_train)

In [32]:
'''
Move this block to where you want to stop the
tracking of used time and memory of the code.
'''

#start_time = time.time()
# Call the function to get memory usage and elapsed time
memory_used, time_elapsed = track_memory_and_time(start_time)

print(f"Memory Used: {memory_used} bytes")
print(f"Time Elapsed: {time_elapsed} seconds")

Memory Used: 10376433664 bytes
Time Elapsed: 76.3227870464325 seconds


The block below prints out the accuracy of the ML algorithm on the test set.
If used algorithm is the decision tree, it also prints out the visualization of the tree.

<b>How could you check the accuracy? Think of a strategy depending on the used algorithm.</b>

Consider cases with correct labels in the training data and cases with unclear cases in the training data.

<b>How does this differ from the quantitative case?</b>

In [33]:
print("The accuracy of the algorithm is: ")
print(accuracy(clf, X_test, y_test))
if name == "Tree":
    tree.plot_tree(clf)

The accuracy of the algorithm is: 
81.19122257053291%


In [34]:
#Creating a function that will identify when the algorithm diverged from the original label in the dataset
def find_differences(clf, X_test, y_test):
    
    predictions = []
    test_labels = []
    column_names = list(X_test.columns.values)
    events = pd.DataFrame(columns=list(X_test.columns.values))
    
    for i in range(len(list(y_test))-1):
        predict = [X_test.drop(columns="Clean", axis=1).values[i+1]]
        prediction = str(clf.predict(predict)[0])
        if str(clf.predict(predict)[0]) != list(y_test)[i+1]:
            predictions.append(prediction)
            test_labels.append(list(y_test)[i+1])
            events.loc[len(events)] = X_test.values[i+1]
            
    df = pd.DataFrame(columns=['Prediction', 'Original label'])
    df['Prediction']= predictions
    df['Original label']= test_labels

    df = pd.concat([df, events], axis=1)
    return df

In [35]:
#Creating a function that will find the documents that the algorithm diverged on
def get_original_doc(differences):
    df = differences(['Prediction', 'Original label', 'Clean'])
    return df

In [36]:
print("Instances in which the algorithm prediction diverged from the original label were: ")
df = find_differences(clf, X_Test, y_test)[['Prediction', 'Original label', 'Clean']]
pd.set_option('display.max_colwidth', 0)
display(df)

Instances in which the algorithm prediction diverged from the original label were: 


Unnamed: 0,Prediction,Original label,Clean
0,Technology,Scientific knowledge,"Body CERN creates record particle collision On March 30, the European Organization for Nuclear Research made a huge leap in the field of quantum physics. Using the Large Hadron Collider particle accelerator, scientists achieved energies from the collision of two protons at speeds nearly 3.5 times higher than previously recorded, making it the largest ever release of energy by a particle accelerator."
1,Human capital,Technology,"Body (ANSA) - November 6 - CERN said Wednesday that its council has appointed Italian physicist Fabiola Gianotti for a second mandate as the Director-General of the European Organization for Nuclear Research, CERN. The 59-year-old Rome native, who participated in the discovery of the Higgs boson, became the first woman to head the lab when she started her first term in 2016."
2,Technology,Human capital,"The feeling was mutual. Johan said he was going to CERN for a year,Chavez said, using the French acronym for the European Organization for Nuclear Research, home of the world's most powerful particle collider. And I said, 'I'm going to be doing an installation there"
3,Human capital,Technology,"But someday soon, conventional X-rays may be replaced by a device that can produce highly detailed 3D color pictures of what`s going on inside the human body. The new device, the MARS x-ray scanner, uses Medipix imaging technology originally developed at CERN, the European Organization for Nuclear Research. As this CERN media release explains, Medipix actually detects and counts each individual particle hitting the pixels when its electronic shutter is open, enabling the creation of high-contrast, super-accurate images."
4,Technology,Human capital,"Stefano Buono, 52, became a director in May 2018 and serves on the Nominating and Corporate Governance Committee. Mr. Buono is an accomplished Italian physicist and alumnus of The European Organization for Nuclear Research. Until January 2018, Mr. Buono was the Chief Executive Officer and President of Advanced Accelerator Applications (AAA), an international radiopharmaceutical company he founded in 2002."
5,Scientific knowledge,Technology,"A giant magnet in Europe will not destroy the planet Before the European Organization for Nuclear Research fired up the Large Hadron Collider in 2008, critics worried that smashing together protons in a 17-mile ring underground would create a black hole that would swallow the earth. Scientists had smaller ones in mind."
6,Human capital,Scientific knowledge,All Rights Reserved Section: International; Foreign Organizations Length: 532 words Byline: CERN - European Organization for Nuclear Research
7,Technology,Scientific knowledge,"Those questions remain unanswered. In other news FoxNews reported, There are some indications that physicists working at the LHC accelerator at the European Organization for Nuclear Research (CERN) near Geneva may see the first traces of physics beyond the current theory which describes the structure of matter, said the IFJ PAN. More data are needed before we can tell anything significant on this, so we will have to wait for the LHC to restart (soon), he explained via email, noting the importance of patience when recording and analyzing data."
8,Technology,Scientific knowledge,"Body Russia withdrew its application for membership in the European Organization for Nuclear Research (CERN). Our country takes an active part in the work of this scientific center, hundreds of Russian scientists are conducting research there, our enterprises make unique equipment tens of millions of dollars worth, including for the famous Large Hadron Collider."
9,Human capital,Scientific knowledge,"Since August 2015, Hossein Ali Khosroabadi has been working as SESAME's Beamline Optics Scientist. For 7 months, Iran seconded Ehsan Yousefi to CERN (European Organization for Nuclear Research) to help in construction of the magnetic system for SESAME's storage ring - construction of the magnets has been funded by the European Commission and led by CERN, in collaboration with SESAME. An Iranian, Fatemeh Elmi of the University of Mazandaran, was one of the first scientists to use the Fourier Transform Infrared microscope installed at SESAME in the summer of 2014 to jump-start research -"


<b>How do you know if this is to be agreed with or not? How is this different from the quantitative case? How would you go about improving the process and how would you get the resourches for this?</b>

Do the following reflection and calculations for either the qualitative or the quantitative task:

Read on the __[energy use and co2](https://medium.com/stanford-magazine/carbon-and-the-cloud-d6f481b79dfe)__ effects of different types of AI. Briefly summarized, algorithms ran on a personal hard disk, which requires about 0.000005 kWh per gigabyte to save your data, whereas the combination of transmitting your data and storing it in a data center probably requires about 3 to 7 kWh per gigabyte. Moreover, storing 100 gigabytes of data in the cloud during a year releases 0.2 tons of CO2.

Are you currently running your algorithm in the cloud of on your personal device? Based on the memory use trackers you have used in the code, calculate how much more or less (in %) doing the opposite would do.

Who bears the costs of the pollution? Estimate the costs of pollution and consider a scenario where the data storer and user would bear the costs relevant to this. 

In addition, consider the rare minerals and their associated pollution required for an average server farm.