# Keyword Analysis on Fedora Mailing List using Watson NLU

This notebook demonstrates analyzing the Fedora mailing list using Watson Natural Language Understanding - using default models.

This is based on the Watson Discovery Tutorial at https://github.com/spackows/CASCON-2019_NLP-workshops/blob/master/notebooks/Notebook-1_Exploring-NLU.ipynb

In [137]:
import json
import pandas as pd
import numpy as np
import glob
import itertools
import re
import os
import gc
from src import utils
import datetime
from collections import Counter
from collections import defaultdict

from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_watson.natural_language_understanding_v1 import Features, ConceptsOptions, EmotionOptions, EntitiesOptions, KeywordsOptions, SemanticRolesOptions, SentimentOptions, CategoriesOptions, SyntaxOptions, SyntaxOptionsTokens

In [115]:
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data")

## Step 1: Look up Natural Language Understanding API key and URL

1. From the **Navigation menu** ( <img style="margin: 0px; padding: 0px; display: inline;" src="https://github.com/spackows/CASCON-2019_NLP-workshops/raw/master/images/nav-menu-icon.png"/> ), under the **Services** group, right-click "Watson Services" and then open the link in a new browser tab
2. In the new Watson services tab, from the **Action** menu beside your Natural Language Understanding instance, select "Manage in IBM Cloud"
3. In the service details page that opens, copy the apikey and URL

In [2]:
apikey = '' #<-- PASTE YOUR APIKEY HERE
url    = '' #<-- PASTE YOUR SERVICE URL HERE

The NLU API can be used to extract:
- Sentiment
- Emotion
- Keywords
- Entities
- Categories
- Concepts
- Syntax
- Semantics

In [117]:
# Instantiate a natural language understanding object
authenticator = IAMAuthenticator( apikey )
nlu = NaturalLanguageUnderstandingV1( version='2018-11-16', authenticator=authenticator )
nlu.set_service_url( url )

## Step 2: Import Fedora Email List

In [118]:
df = utils.load_dataset(f"{BASE_PATH}/interim/text/")
df.head()

Unnamed: 0,Message-ID,Date,Body
0,<681b9b44-e961-1fa8-3708-1ff5b76dff3d@arm.com>,"Wed, 31 Jan 2018 19:34:24 -0600","['On 01/31/2018 09:49 AM, J. Bruce Fields wrot..."
1,<20180201014242.GB17460@jelly>,"Thu, 01 Feb 2018 11:42:42 +1000","['On Wed, Jan 31, 2018 at 10:18:14PM +0000, To..."
2,<CABB28CzbBUA6WCs5t9ee4g=CecCJ+hOm8tey4W7RsyRG...,"Thu, 01 Feb 2018 02:13:41 +0000","['On 1 February 2018 at 01:42, Peter Hutterer ..."
3,<CABB28Cz13c0Z_y2j54RDMoOXdLAU19pxE1++xZWfVX76...,"Thu, 01 Feb 2018 04:48:50 +0000","['Hi,\n\nJust applied all recent updates and s..."
4,<CABbDtLqovkBOBw2YM+4kP1pTG7Na4DPZj6ZGojrj9snh...,"Thu, 01 Feb 2018 10:59:36 +0530","[""On Wed, Jan 31, 2018 at 11:56 PM, Josh Boyer..."


In [119]:
df = df[150:200]

Only taking a small sample of the dataframe due to 2 reasons:
* On the Free IBM Watson Account, we are allowed only a limited number of queries
* Some emails have code and other unclean text which all cause the following watson_analyze() function to throw an error

#TODO: Import data cleaning scripts from auto-faq project to overcome above error

In [120]:
# Cleaning function to be separated from this nb as a part of https://github.com/aicoe-aiops/mailing-list-analysis-toolkit/issues/41
def strip_thread(text):
    text = text.replace("\r", "")
    lines = text.split("\n")
    lines = [line for line in lines if len(line) > 0]
    lines = [line for line in lines if line[0] != ">"]
    lines = [line for line in lines if line[:3] != "Re:"]
    lines = [line for line in lines if line[:7] != "Subject"]
    lines = [line for line in lines if line[:5] != "From:"]
    lines = [line for line in lines if line[:5] != "Date:"]
    lines = [line for line in lines if "BEGIN PGP SIGNED MESSAGE" not in line]
    lines = [line for line in lines if line[:5] != "Hash:"]
    lines = [line for line in lines if line[:10] != "Version: G"]
    lines = [line for line in lines if "wrote:" not in line]
    lines = [line for line in lines if "wrote :" not in line]
    lines = [line for line in lines if "writes:" not in line]
    lines = [line for line in lines if line[:7] != "Am Mit,"]
    lines = [line for line in lines if line[:7] != "Am Don,"]
    lines = [line for line in lines if line[:7] != "Am Mon,"]
    lines = [line for line in lines if line[:7] != "Quoting"]
    lines = [line for line in lines if line[:10] != "Em Quinta,"]
    lines = [line for line in lines if "said:" not in line]
    lines = [
        line
        for line in lines
        if re.match(
            ".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), .. (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec) 20..*",
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            (
                ".*n (Sunday|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday) .."
                " (January|February|March|April|May|June|July|August|September|October|November|December) 20..*"
            ),
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            ".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec) .., 20..*",
            line,
        )
        is None
    ]
    lines = [
        line
        for line in lines
        if re.match(
            r".*n (Sun|Mon|Tue|Wed|Thu|Fri|Sat), 20[\d]{2}-[\d]{2}-[\d]{2} at.*",
            line,
        )
        is None
    ]
    lines = [line for line in lines if line[-6:] != "said: "]
    lines = [line for line in lines if line[-8:] != "babbled:"]
    lines = [line for line in lines if line[-7:] != "wrot=e:"]
    lines = [line for line in lines if line[-8:] != "A9crit :"]
    lines = [line for line in lines if line[0] != "|"]
    return "\n".join(lines)


# format for CSV, clean special characters, and remove extranous emails
def pandas_clean(emails):
    emails["Body"].replace(
        to_replace=[
            r"\n",
            "\n",
        ],
        value=" ",
        regex=True,
        inplace=True,
    )
    emails["Body"].replace(
        to_replace=[r"\'", "'", ">", "<", "= ", "-", r"http\S+"],
        value="",
        regex=True,
        inplace=True,
    )
    emails["Body"].replace(
        to_replace=[r"\\\s+", r"\\s+", "="], value="", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["_", "3D"], value="", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"].replace(
        to_replace=["   ", "  "], value=" ", regex=True, inplace=True
    )
    emails["Body"] = emails["Body"].apply(
        lambda x: x.strip().replace(r"\n", "")
    )

    emails.drop(emails.index[emails["Body"] == ""], inplace=True)
    emails.drop(emails.index[emails["Body"] == " "], inplace=True)
    emails.dropna(subset=["Body"], inplace=True)

    emails = emails.reset_index()
    emails.drop("index", axis=1, inplace=True)
    return emails

In [121]:
clean = df.copy()
clean["Body"] = df["Body"].apply(strip_thread)
clean = pandas_clean(clean)
clean

Unnamed: 0,Message-ID,Date,Body
0,<23160.21991.603350.704278@localhost.localdomain>,"Mon, 05 Feb 2018 14:02:31 +0100","[""Hallo,I am posting the following to the deve..."
1,<1497001315.12366749.1517840821880.JavaMail.zi...,"Mon, 05 Feb 2018 15:27:01 +0100","[ Mail original De: ""Jakub Cajka"" Our as Fedo..."
2,<1007665295.12665074.1517845711607.JavaMail.zi...,"Mon, 05 Feb 2018 16:48:31 +0100","[ Mail original De: ""Jakub Cajka"" Hi Jakub I t..."
3,<20180205220003.8C9E160A4182@fedocal02.phx2.fe...,"Mon, 05 Feb 2018 22:00:03 +0000","[Dear all,You are kindly invited to the meetin..."
4,<CAObL_7E0HOrrim=utF+G+oZEpi1qo2yzLbo2oGZUvoXe...,"Mon, 05 Feb 2018 22:02:31 +0000","[On Thu, Feb 1, 2018 at 11:34 AM, Hans de Goed..."
5,<CACmE3aMjMkFLWQ4x-sh+NNE9aS6my1NVPw-M8K-8hujs...,"Mon, 05 Feb 2018 22:19:12 +0000",[Hi Matthew!Nice to have you on board :)Cheers...
6,<CACmE3aME8cdidFOZJEPa8xgxycA6ETC5UODVUgmvF4E7...,"Mon, 05 Feb 2018 22:23:35 +0000","[Thank you all for the warm welcome! :), PGRpd..."
7,<20180206000654.24998.31154@mailman01.phx2.fed...,"Tue, 06 Feb 2018 00:06:54 +0000","[Hello, Great, If there are virtual posibility..."
8,<20180206060714.8900.73898@mailman01.phx2.fedo...,"Tue, 06 Feb 2018 06:07:14 +0000","[ Submit a PR upstream to fix it, and use that..."
9,<20180206073834.GA17034@redhat.com>,"Tue, 06 Feb 2018 07:38:34 +0000",[binutils 2.30 has been out for about a week. ...


In [122]:
clean["Date"] = clean["Date"].apply(lambda x: pd.to_datetime(x))
clean["Chunk"] = clean["Date"].apply(
    lambda x: datetime.date(x.year, x.month, 1)
)
clean = clean.sort_values(by="Date")
clean.reset_index(inplace=True, drop=True)
clean.head()

Unnamed: 0,Message-ID,Date,Body,Chunk
0,<23160.21991.603350.704278@localhost.localdomain>,2018-02-05 14:02:31+01:00,"[""Hallo,I am posting the following to the deve...",2018-02-01
1,<1497001315.12366749.1517840821880.JavaMail.zi...,2018-02-05 15:27:01+01:00,"[ Mail original De: ""Jakub Cajka"" Our as Fedo...",2018-02-01
2,<1007665295.12665074.1517845711607.JavaMail.zi...,2018-02-05 16:48:31+01:00,"[ Mail original De: ""Jakub Cajka"" Hi Jakub I t...",2018-02-01
3,<20180205220003.8C9E160A4182@fedocal02.phx2.fe...,2018-02-05 22:00:03+00:00,"[Dear all,You are kindly invited to the meetin...",2018-02-01
4,<CAObL_7E0HOrrim=utF+G+oZEpi1qo2yzLbo2oGZUvoXe...,2018-02-05 22:02:31+00:00,"[On Thu, Feb 1, 2018 at 11:34 AM, Hans de Goed...",2018-02-01


## Step 3: Analyze sample customer messages

For our analysis, we'll focus on extracting:
- Keywords 
- Actions and Objects (from semantic roles)

In [123]:
def watson_analyze(message):
    
    """
    Extract keywords and sematic roles from text
    Input : message
    Output : Array of action verbs, Array of Keywords
    """
    results_list = []
    
    result = nlu.analyze( text=message, features=Features( keywords=KeywordsOptions(), semantic_roles=SemanticRolesOptions() ) ).get_result()
    actions_arr = []
    keywords_arr = []
    for keyword in result["keywords"]:
        keywords_arr.append( keyword["text"] )
    if( "semantic_roles" in result ):
        for semantic_result in result["semantic_roles"]:
            if( "action" in semantic_result ):
                actions_arr.append( semantic_result["action"]["normalized"] )
                
    return pd.Series([actions_arr, keywords_arr])

In [124]:
clean["Actions"] = np.nan
clean["Keywords"] = np.nan

In [125]:
clean[['Actions', 'Keywords']] = clean['Body'].apply(lambda x: watson_analyze(x))

In [126]:
clean.head()

Unnamed: 0,Message-ID,Date,Body,Chunk,Actions,Keywords
0,<23160.21991.603350.704278@localhost.localdomain>,2018-02-05 14:02:31+01:00,"[""Hallo,I am posting the following to the deve...",2018-02-01,"[be, be post, have, be, fail, see, do, savecha...","[KDE print dialog, Fedora cinnamon spin, cinna..."
1,<1497001315.12366749.1517840821880.JavaMail.zi...,2018-02-05 15:27:01+01:00,"[ Mail original De: ""Jakub Cajka"" Our as Fedo...",2018-02-01,"[believe, be prepare, really share, be merge, ...","[yum version, dumb decision, dedicated package..."
2,<1007665295.12665074.1517845711607.JavaMail.zi...,2018-02-05 16:48:31+01:00,"[ Mail original De: ""Jakub Cajka"" Hi Jakub I t...",2018-02-01,"[think, don\t, do, want, replace, be, create, ...","[Fedora Go packages, lot of work, t cover, opt..."
3,<20180205220003.8C9E160A4182@fedocal02.phx2.fe...,2018-02-05 22:00:03+00:00,"[Dear all,You are kindly invited to the meetin...",2018-02-01,[invite],"[Modularity WG, Meeting of the Modularity Work..."
4,<CAObL_7E0HOrrim=utF+G+oZEpi1qo2yzLbo2oGZUvoXe...,2018-02-05 22:02:31+00:00,"[On Thu, Feb 1, 2018 at 11:34 AM, Hans de Goed...",2018-02-01,"[write, work, like, off be, be, be know to cau...","[driver PSR support, Improved Laptop Battery L..."


## Step 4: Aggregate Keywords by Month


Here we aggregate the keywords extracted using watson NLU for each email by month and create a long dataframe with the columns `month`, `word`, and `count` which can be used for making plots.

Note : To observe monthly trends, we need to be able to run the analysis on a larger sample of the data which will let us get more months and keywords to analyze.

For that, we need a Paid account to be able to run a larger query as well as for us to succesfully run the analysis without an error, we have to be able to clean the dataset to remove the code fragments

In [190]:
months = set()
monthly_dict_list = []

for index, row in clean.iterrows():

    if row['Chunk'] not in months: 
        
        months.add(row['Chunk'])
        
        month_keywords = dict(sorted(month_keywords.items(), key=lambda item: item[1]), reverse=True)
        
        monthly_dict_list.append(month_keywords)
        month_keywords = defaultdict(int)

    for word in row['Keywords']:
        month_keywords[word] += 1
        month_keywords[(str(month), str(word))] = month_keywords.pop(word)


In [195]:
monthly_words_df = pd.DataFrame(
    [{"month": key[0], "word": key[1], "count": value}  for i in range(len(monthly_dict_list)) for (key), value in monthly_dict_list[i].items()])

In [198]:
monthly_words_df.head(10)

Unnamed: 0,month,word,count
0,2018-02-01,KDE print dialog,1
1,2018-02-01,Fedora cinnamon spin,1
2,2018-02-01,cinnamon desktop,1
3,2018-02-01,KDE problem,1
4,2018-02-01,CUPS settings,1
5,2018-02-01,KDE integrationin Fedora,1
6,2018-02-01,devel list,1
7,2018-02-01,mostrelevant application,1
8,2018-02-01,cinnamon spin,1
9,2018-02-01,settings,1


In [199]:
monthly_words_df.tail(10)

Unnamed: 0,month,word,count
228,2018-02-01,programming,1
229,2018-02-01,week,1
230,2018-02-01,Windows,1
231,2018-02-01,Supports Linux,1
232,2018-02-01,modest enhancements,1
233,2018-02-01,binutils,1
234,2018-02-01,anysoftware,1
235,2018-02-01,inRawhide,1
236,2018-02-01,RISCVarchitecture.Rich,1
237,r,e,True


## Step 5: Save results



In [200]:
new_files = []

monthly_words_df.to_csv(
    f"{BASE_PATH}/processed/keywords/watson-nlu-keywords.csv", header=False
)
new_files.append(f"{BASE_PATH}/processed/keywords/watson-nlu-keywords.csv")

In [202]:
if os.getenv("RUN_IN_AUTOMATION"):
    utils.upload_files(
        (f, f"processed/keywords/{Path(f).stem}.csv") for f in new_files
    )

Copyright © 2019 IBM. This notebook and its source code are released under the terms of the MIT License.