## Below type in the specifications for how you want to run the example with CERN news pieces

For language to be understood by computers, it needs to be turned into a numerical form. That begins with separating the words in a document into units called tokens. Here, the example uses simple tokens of singular words which are lemmatized: For the words the "base" form is found out with a statistical approach.

<b>There are different ways to do tokenization (e.g., n-grams) and lemmatization (e.g., stemming or doing nothing). Quickly search for the benefits of the different approaches and decide what you want to choose for the few of such selected options for parameters below.</b>

In [1]:
'''
Choose below "NO" for no stopword removal, "YES" for stopword removal
'''
chosen_stopword_removal = "YES"

'''
Choose below "YES" to lemmatize documents, else choose "NO"
'''
chosen_lemmatization = "YES"

'''
Choose below "BOW" or "TFIDF" for vectorization
'''
chosen_vectorization = "TFIDF"

Below are some different types of machine learning algorithms. Acquaint yourself with the basic pros and cons of each. If you wish to, you can also find other types of ML to learn about at scikit-learn.org.

__[Neural Network](https://scikit-learn.org/stable/modules/neural_networks_supervised.html)__

__[Support Vector Machine](https://scikit-learn.org/stable/modules/svm.html#svm)__

__[Nearest Neighbours](https://scikit-learn.org/stable/modules/neighbors.html)__

__[Decision Tree](https://scikit-learn.org/stable/modules/tree.html#tree)__

Some specifics are required to the algorithms such as how many iterations to use, how to scale data, etc. If you want to get acquainted with the different such options, use the links above to learn about the parameters and their effects.

However, <b>consider: Who would know which parameters to use and why if you were to implement AI in your organization?</b>

In [2]:
'''
Choose below
"KNN" for nearest neighbours
"NN" for neural network
"SVM" for support vector machine
"Tree" for decision-tree
'''
chosen_algorithm = "NN"

Here we import the basic libraries required for data analysis with Python!

In [3]:
#Here we import the basic Python data analytics libraries
import warnings
warnings.filterwarnings("ignore")

import pandas as pd   # includes tools used in reading data
import numpy as np   # includes tools for numerical calculus
import matplotlib.pyplot as plt  # includes tools used in plotting data

Below we define a function that we can use to track how much time and memory the code uses between a certain part of running the code.

In [4]:
import psutil
import time

def track_memory_and_time(start_time=None):
    """
    Track memory usage and elapsed time since a specific point in the code.

    Parameters:
    - start_time: Optional parameter. If provided, it should be the result of a previous call to time.time().

    Returns:
    - memory_usage: Current memory usage in bytes.
    - elapsed_time: Elapsed time in seconds since the specified start time.
    """

    # Get memory usage
    memory_usage = psutil.virtual_memory().used

    # Get elapsed time
    current_time = time.time()
    elapsed_time = current_time - start_time if start_time else 0

    return memory_usage, elapsed_time

In [5]:
#Defining and creating a function with which you can later on get the accuracy of the ML/AI algorithm
def accuracy(clf, X_test, y_test):
    score = str(clf.score(X_test, y_test)*100)+'%'

    return score

In [6]:
#this is just defining and creating a function with which you can later on compare the AI predictions to the original data
def find_differences(clf, X_test, y_test):
    
    predictions = []
    test_labels = []
    column_names = list(X_test.columns.values)
    events = pd.DataFrame(columns=list(X_test.columns.values))
    
    for i in range(len(list(y_test))-1):
        if str(clf.predict([X_test.values[i+1]])[0]) != list(y_test)[i+1]:
            predictions.append(clf.predict([X_test.values[i+1]])[0])
            test_labels.append(list(y_test)[i+1])
            events.loc[len(events)] = X_test.values[i+1]
               
    df = pd.DataFrame(columns=['Prediction', 'Original label'])
    df['Prediction']= predictions
    df['Original label']= test_labels

    df = pd.concat([df, events], axis=1)
    return df

In [7]:
#Move this block to the location in the code from where you want to start tracking
start_time = time.time()

In [8]:
#Here we import the different ML algorithms from the scikit-learn library

from sklearn.neural_network import MLPClassifier
#Neural Network

from sklearn import svm
#Support Vector Machine

from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier 
#Nearest Neighbours

from sklearn import tree
#Decision tree

In [9]:
def choose_algorithm(name):
    if name == "NN":
        clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train, y_train)
    elif name == "KNN":
        from sklearn.pipeline import Pipeline
        from sklearn.preprocessing import StandardScaler
        clf = Pipeline(steps=[("scaler", StandardScaler()), ("knn", KNeighborsClassifier(n_neighbors=11))])
    elif name == "SVM":
        clf = svm.SVC(kernel='rbf')#'linear' or 'rbf'
    else:
        clf = tree.DecisionTreeClassifier(max_depth=12)
        
    return clf

## Qualitative

For supervised machine learning, you need a dataset and correct labels to give to it.

<b>For your work, think of an AI task using qualitative data, e.g., text of images, that would be useful for you and determine the type of data you would need to teach to an AI: Where or how can you get 10k+ examples with the correct "labels" assigned to the datapoints - what resources would you need to create/retrieve this data?</b>

The example here uses text pieces.

In [10]:
#! conda install -c conda-forge spacy

In [11]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 435.7 kB/s eta 0:00:30
     --------------------------------------- 0.1/12.8 MB 762.6 kB/s eta 0:00:17
     - -------------------------------------- 0.5/12.8 MB 3.6 MB/s eta 0:00:04
     ---- ----------------------------------- 1.5/12.8 MB 8.1 MB/s eta 0:00:02
     -------- ------------------------------- 2.7/12.8 MB 11.4 MB/s eta 0:00:01
     ------------ --------------------------- 3.9/12.8 MB 13.8 MB/s eta 0:00:01
     --------------- ------------------------ 4.8/12.8 MB 14.6 MB/s eta 0:00:01
     ---------------- ----------------------- 5.2/12.8 MB 13.8 MB/s eta 0:00:01
     ------------------- -----------------

In [12]:
'''
Importing the Python libraries to analyze and treat text data.
'''
import re
import spacy
spacy_model = "en_core_web_sm"

Opening up the file with the data, this is the example, if you have different data in mind, you can use that...

In [13]:
df = pd.read_excel('cern_news_data.xlsx')
#Choosing the columns we want for our analysis
df = df[['Document', 'Label']]
#Dropping rows with empty data
df = df.dropna(how = 'any',axis = 0).reset_index(drop = True)
#Dropping rows of duplicate text documents
df = df.drop_duplicates(subset="Document")
df = df.sample(frac = 1).reset_index(drop=True)
df.head()

Unnamed: 0,Document,Label
0,Since starting at KU she has worked with Disti...,Human capital
1,Inc. All Rights Reserved\n\n\nLength: 260 word...,Human capital
2,"Founded in 2004, Zecotek operates three divisi...",Human capital
3,Winning Company Details: The European Organiza...,Scientific knowledge
4,Page of \nCERN exhibition due to begin in Kuw...,Human capital


<b>What operations do you need to perform on the data you have to clean it up?</b>

In [14]:
#Making a new list of text documents with extra whitespace and asterisks and quotation marks 
#that would complicate further cleansing removed
cleaner_documents = [text.replace("*", " ").replace('"','') for text in list(df['Document'])]
clean_documents = [re.sub('[\s+]', ' ',text) for text in cleaner_documents]

#Adding the cleansed documents to the dataframe
df['Document Clean'] = clean_documents

In [15]:
#Seeing what the cleansed documents look like
for clean_document in clean_documents[:5]:
    print(clean_document+"\n")

Since starting at KU she has worked with Distinguished Professor of Physics and Astronomy Alice Bean and Professor of Physics and Astronomy Phil Baringer in the Particle Physics Laboratory. With them, she has been involved in research at CERN (The European Organization for Nuclear Research) in Geneva, Switzerland. Her current research focuses on the search for the theoretical top-prime particle in relation to the recently-discovered Higgs boson.

Inc. All Rights Reserved   Length: 260 words Body   Limelight Networks reported that the European Organization for Nuclear Research (CERN) is using Limelight's Content Delivery Network (CDN) to live stream scientific discoveries to thousands of physicists and engineers worldwide. CERN, a physics laboratory, regularly shares knowledge with scientists and the general public around the world.

Founded in 2004, Zecotek operates three divisions: Imaging Systems, Optronics Systems and 3D Display Systems with labs located in Canada, Korea, Russia, Si

In [16]:
#Lemmatization:
lemmatized = True
if chosen_lemmatization == "NO":
    lemmatized = False

#Loading a python library for natural language processing
nlp = spacy.load(spacy_model, disable=['merge_noun_chunks'])

#Creating a function that will dothe basic tokenization of the documents
def basic_tokenizer(document, lemmatized=lemmatized):
    #Converting the text document into a Spacy document
    document = nlp(document)
    if chosen_stopword_removal == "YES":
        if not lemmatized:
            tokenized = [token.text.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                         and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                                  or token.like_url or token.like_email)]
        if lemmatized:
            tokenized = [token.lemma_.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                         and not (token.is_stop or token.is_punct or token.is_space or token.like_num 
                                  or token.like_url or token.like_email)]
    else:
        if not lemmatized:
            tokenized = [token.text.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                         and not (token.is_punct or token.is_space or token.like_num 
                                  or token.like_url or token.like_email)]
        if lemmatized:
            tokenized = [token.lemma_.lower() for token in document if token.ent_iob == 2 #<- This removes known entities
                         and not (token.is_punct or token.is_space or token.like_num 
                                  or token.like_url or token.like_email)]        
    #Returns a list of tokens
    return tokenized

#Initializing a list where to add the treated documents
tokenized_documents = []

for document in clean_documents:
    #using basic tokenizer on the document with Spacy's chunks disabled
    tokenized_documents.append(basic_tokenizer(document))

#Adding the tokenized documents to the dataframe
df['Tokenized'] = tokenized_documents

#printing a few examples of what the treated documents look like now
for i in range(5):
    print("Cleansed: "+clean_documents[i]+"\n")
    print("Tokenized: "+str(tokenized_documents[i])+"\n")

Cleansed: Since starting at KU she has worked with Distinguished Professor of Physics and Astronomy Alice Bean and Professor of Physics and Astronomy Phil Baringer in the Particle Physics Laboratory. With them, she has been involved in research at CERN (The European Organization for Nuclear Research) in Geneva, Switzerland. Her current research focuses on the search for the theoretical top-prime particle in relation to the recently-discovered Higgs boson.

Tokenized: ['start', 'ku', 'work', 'professor', 'physics', 'involve', 'research', 'current', 'research', 'focus', 'search', 'theoretical', 'prime', 'particle', 'relation', 'recently', 'discover']

Cleansed: Inc. All Rights Reserved   Length: 260 words Body   Limelight Networks reported that the European Organization for Nuclear Research (CERN) is using Limelight's Content Delivery Network (CDN) to live stream scientific discoveries to thousands of physicists and engineers worldwide. CERN, a physics laboratory, regularly shares knowle

After the documents have been tokenized and treated as still words, it's relevant to turn the text into numeric form that computer algorithms can understand. Two common ways are TFIDF and BOW vectorizations. Acquaint yourself with both, and choose which one you want to use.

Do you want to use __[TFIDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#)__ or __[Bag-of-Words](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)__ vectorization?

<b>Explain why you decided as you did</b>

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

#the dummy function that returns the already tokenized document
def id_fun(already_tokenized):
    return already_tokenized

#initializing tf-idf
if chosen_vectorization == "TFIDF":
    vectorizer = TfidfVectorizer(
        analyzer='word',
        tokenizer=id_fun,
        preprocessor=id_fun,
        token_pattern=None)
    
#initializing bag-of-words
else:
    vectorizer = CountVectorizer(
        analyzer='word',
        tokenizer=id_fun,
        preprocessor=id_fun,
        token_pattern=None)
    
#implementing the vectorization
vectorized = vectorizer.fit_transform(tokenized_documents)
#tweaking the form of the data for analysis
dense = vectorized.todense()

In [18]:
'''
Making a simple function that will name all the columns 
in the dataset of the vectorized documents
'''

def name_x(dense, doc):
    shape = dense.shape
    # Generate column names with running numeration
    column_names = [f'x_{i+1}' for i in range(shape[1])]

    data = dense

    df = pd.DataFrame(data, columns=column_names)
    df['Clean'] = doc
    
    return df

In [19]:
'''
Defining which part of the data is the data and which one is the label,
using the function defined just previously
'''

X = name_x(dense, df['Document Clean'])
y =  df['Label']

#printing out a small example of what the data looks like
X.head()

Unnamed: 0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,x_9,x_10,...,x_4092,x_4093,x_4094,x_4095,x_4096,x_4097,x_4098,x_4099,x_4100,Clean
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Since starting at KU she has worked with Disti...
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Inc. All Rights Reserved Length: 260 words B...
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Founded in 2004, Zecotek operates three divisi..."
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Winning Company Details: The European Organiza...
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Page of CERN exhibition due to begin in Kuwa...


<b>What are the cleaning operations performed above on the CERN datasets? Try to search for what they do and think about why they should/should not be performed.</b>

Then we split the found data to training and testing data. 
Sometimes we also use validation data. 

<b>What are common splits in percentages</b>

In [20]:
from sklearn.model_selection import train_test_split
X_Train, X_Test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=1)

In [21]:
#Removing the non-numerical column from the data before passing this on to the ML algorithms
X_train = X_Train.drop(['Clean'], axis =1)
X_test = X_Test.drop(['Clean'], axis =1)

In [22]:
if chosen_algorithm == "NN":
    from sklearn.preprocessing import StandardScaler  
    scaler = StandardScaler()  
    # Don't cheat - fit only on training data
    scaler.fit(X_train)  
    X_train = scaler.transform(X_train)  
    # apply same transformation to test data
    X_test = scaler.transform(X_test) 

In [23]:
#Move this block to the location in the code from where you want to start tracking
start_time = time.time()

Now choosing and fitting the ML algorithm you chose earlier.

In [24]:
name = chosen_algorithm
clf = choose_algorithm(name).fit(X_train, y_train)

In [25]:
'''
Move this block to where you want to stop the
tracking of used time and memory of the code.
'''

#start_time = time.time()
# Call the function to get memory usage and elapsed time
memory_used, time_elapsed = track_memory_and_time(start_time)

print(f"Memory Used: {memory_used} bytes")
print(f"Time Elapsed: {time_elapsed} seconds")

Memory Used: 12570529792 bytes
Time Elapsed: 23.910399436950684 seconds


The block below prints out the accuracy of the ML algorithm on the test set.
If used algorithm is the decision tree, it also prints out the visualization of the tree.

<b>How could you check the accuracy? Think of a strategy depending on the used algorithm.</b>

Consider cases with correct labels in the training data and cases with unclear cases in the training data.

In [26]:
print("The accuracy of the algorithm is: ")
print(accuracy(clf, X_test, y_test))
if chosen_algorithm == "Tree":
    tree.plot_tree(clf)

The accuracy of the algorithm is: 
75.23510971786834%


In [27]:
#Creating a function that will identify when the algorithm diverged from the original label in the dataset
def find_differences(clf, X_test, y_test):
    
    predictions = []
    test_labels = []
    column_names = list(X_test.columns.values)
    events = pd.DataFrame(columns=list(X_test.columns.values))
    
    for i in range(len(list(y_test))-1):
        predict = [X_test.drop(columns="Clean", axis=1).values[i+1]]
        prediction = str(clf.predict(predict)[0])
        if str(clf.predict(predict)[0]) != list(y_test)[i+1]:
            predictions.append(prediction)
            test_labels.append(list(y_test)[i+1])
            events.loc[len(events)] = X_test.values[i+1]
            
    df = pd.DataFrame(columns=['Prediction', 'Original label'])
    df['Prediction']= predictions
    df['Original label']= test_labels

    df = pd.concat([df, events], axis=1)
    return df

In [28]:
#Creating a function that will find the documents that the algorithm diverged on
def get_original_doc(differences):
    df = differences(['Prediction', 'Original label', 'Clean'])
    return df

In [29]:
print("Instances in which the algorithm prediction diverged from the original label were: ")
df = find_differences(clf, X_Test, y_test)[['Prediction', 'Original label', 'Clean']]
pd.set_option('display.max_colwidth', 0)
display(df)

Instances in which the algorithm prediction diverged from the original label were: 


Unnamed: 0,Prediction,Original label,Clean
0,Human capital,Scientific knowledge,All Rights Reserved Section: International; Foreign Organizations Length: 532 words Byline: CERN - European Organization for Nuclear Research
1,Human capital,Scientific knowledge,All Rights Reserved Length: 294 words Body European Organization for Nuclear Research (CERN) has awarded a contract for Research and Development Services and Related Consultancy Services at Switzerland-Geneva. The contract was awarded to ELYTT Energy.
2,Scientific knowledge,Technology,"Body The CERN research board has approved the Forward Search Experiment, giving a green light to the assembly, installation and use of an instrument that will look for new fundamental particles at the Large Hadron Collider in Geneva, Switzerland. Initiated by physicists at the University of California, Irvine, the five-year FASER project is funded by grants of $1 million each from the Heising-Simons Foundation and the Simons Foundation - with additional support from CERN, the European Organization for Nuclear Research. FASER's focus is to find light, extremely weakly interacting particles that have so far eluded scientists, even in the high-energy experiments conducted at the CERN-operated LHC, the largest particle accelerator in the world."
3,Technology,Human capital,"Body (ANSA) - November 6 - CERN said Wednesday that its council has appointed Italian physicist Fabiola Gianotti for a second mandate as the Director-General of the European Organization for Nuclear Research. The 59-year-old Rome native, who participated in the discovery of the Higgs boson, became the first woman to head the lab when she started her first term in 2016."
4,Scientific knowledge,Human capital,"The history of Web sites Over the past three months, I've been providing tips on how to construct and market a Web site but it wasn't always this easy. The ability to construct and promote a site has come a long way since 1989, when the World Wide Web was created by Tim Bemers-Lee, an English computer scientist for the European Organization for Nuclear Research (CERN). Four years later, on April 30, CERN announced that the World Wide Web would be free to use for anyone."
...,...,...,...
91,Scientific knowledge,Human capital,"On Aug. 28, an international team of thousands of researchers -- including Iashvili, Kharchilava and Rappoccio -- announced that they had observed the Higgs boson, a subatomic particle, decaying into a pair of lighter particles called a bottom quark and antibottom quark. The sighting took place at the world's most powerful particle accelerator, the Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN). The finding deepens our understanding of why objects have mass."
92,Human capital,Scientific knowledge,"Length: 348 words Body Paris: United Nations Educational, Scientific and Cultural Organization has issued the following news release: The European Organization for Nuclear Research (CERN) and the United Nations Educational, Scientific and Cultural Organization (UNESCO) will co-organize the UNESCO-CERN School on Digital Libraries 2018 which will be held in Nairobi, Kenya, from 8 to 12 October 2018 and will be hosted by Nairobi University."
93,Scientific knowledge,Human capital,"Chapter eight returns to profiles, this time in the form of nine interviews of contemporary individuals. Each provides intriguing insights into curious personalities, such as physicist Freeman Dyson and his preference for details over the big picture; a fascination with multiple subjects on the part of Fabiola Gianotti, director-general of the European Organization for Nuclear Research (CERN); Brian May, guitarist of the rock band Queen, who later pursued an astrophysics career; and a belief that childhood represented a time of even greater, unrestrained curiosity, typified by Gianotti and autodidact Marilyn vos Savant. Livio recalls the many times societies have sought to limit curiosity."
94,Scientific knowledge,Human capital,"Cornell Chronicle University has issued the following news release: Particle accelerators such as the Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN) produce massive amounts of data that help answer long-held questions regarding Earth and the far reaches of the universe. The Higgs boson, which had been the missing link in the Standard Model of Particle Physics, was discovered there in 2012 and earned researchers the 2013 Nobel Prize in physics."


<b>How do you know if this is to be agreed with or not? How is this different from a case with quantitative data? How would you go about improving the process and how would you get the resourches for this?</b>

Read on the __[energy use and co2](https://medium.com/stanford-magazine/carbon-and-the-cloud-d6f481b79dfe)__ effects of different types of AI. 

Briefly summarized, algorithms ran on a personal hard disk, require about 0.000005 kWh per gigabyte to save your data, whereas the combination of transmitting your data and storing it in a data center probably requires about 3 to 7 kWh per gigabyte. Moreover, storing 100 gigabytes of data in the cloud during a year releases 0.2 tons of CO2.

Are you currently running your algorithm in the cloud of on your personal device? Based on the memory use trackers you have used in the code, calculate how much more or less (in %) doing the opposite would do.

Who bears the costs of the pollution? Estimate the costs of pollution and consider a scenario where the data storer and user would bear the costs relevant to this. 

In addition, consider the rare minerals and their associated pollution required for an average server farm.