**README**
- This notebook includes V1 of the discovery pipeline, focused on twitter. 
- It covers evaluating twitter search results for relevance, scraping text from the URLs of suggested relevant results, and then evaluating those results for relevance before returning a final results list for a user to check. 
- Predictions are run against relevant training sets (twitter, descriptions).
- Once this flow has been tested the aim is to rebuild it with all key functions as external python files so that the user would only have to manipulate the necessary inputs at every step to get a result. 

TO DO:
- Export twitter training to pkl to load faster
- Add sites to blacklist before scraping
- Add values to remove after scraping e.g 404 
- Integrate twitter search

Step 1:
- Load libraries and functions 

In [1]:
#imports + path
from __future__ import print_function
import requests
import pandas as pd
import pickle
import numpy as np
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.metrics import classification_report 
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from bs4 import BeautifulSoup
pd.set_option('display.max_rows', 100)
path = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/'

In [2]:
#PREDICTION FUNCTION W/ and W/O LOGREGCV 

def lr_model_predict_cv(t_input, t_feature, target, cv_int, score_type, p_input, p_feature, filename, path):
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    x_count = count_vect.fit_transform(t_input[t_feature])
    x_train = tfidf_transformer.fit_transform(x_count)
    y_train = t_input[target].values
    model = LogisticRegressionCV(solver='liblinear', random_state=44, cv=cv_int, scoring=score_type)
    model.fit(x_train, y_train)
    export = f'LOGREG_RELEVANCE/{filename}.sav'
    pickle.dump(model, open(path+export, 'wb'))
    x_new_count = count_vect.transform(p_input[p_feature])
    x_new_train = tfidf_transformer.transform(x_new_count)
    y_predict = model.predict(x_new_train)
    scores = model.decision_function(x_new_train)
    probability = model.predict_log_proba(x_new_train)
    results = [r for r in y_predict]
    result = p_input.copy()
    result['Prediction'] = results
    result['Score'] = [s for s in scores]
    result['Probability'] = [p for p in probability]
    result['Input Length'] = result[p_feature].str.len()
    return result

def lr_model_predict(t_input, t_feature, target, p_input, p_feature, filename, path):
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    x_count = count_vect.fit_transform(t_input[t_feature])
    x_train = tfidf_transformer.fit_transform(x_count)
    y_train = t_input[target].values
    model = LogisticRegression(solver='liblinear', C=10.0,random_state=44)
    model.fit(x_train, y_train)
    export = f'LOGREG_RELEVANCE/{filename}.sav'
    pickle.dump(model, open(path+export, 'wb'))
    x_new_count = count_vect.transform(p_input[p_feature])
    x_new_train = tfidf_transformer.transform(x_new_count)
    y_predict = model.predict(x_new_train)
    scores = model.decision_function(x_new_train)
    probability = model.predict_log_proba(x_new_train)
    results = [r for r in y_predict]
    result = p_input.copy()
    result['Prediction'] = results
    result['Score'] = [s for s in scores]
    result['Probability'] = [p for p in probability]
    result['Input Length'] = result[p_feature].str.len()
    return result

In [3]:
#scrape URLs for title, desc and URL text 

def scrape_links(link_list):
    links = pd.DataFrame(columns=['Title', 'Description', 'URL'])
    for link in link_list:
        URL = link
        try:
            headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
            page = requests.get(URL, headers=headers, timeout=5)
            status = page.status_code
            if status == 200:
                soup = BeautifulSoup(page.content, "html.parser")
                if soup and soup.find('head') and soup.find('body'):
                    title = ' '.join([t.text for t in soup.find('head').find_all('title')]).strip()
                    text = ' '.join([p.text for p in soup.find('body').find_all('p')]).strip()
                    new_row = {'Title': title, 'Description': text, 'URL': URL.strip()}
                    links = links.append(new_row, ignore_index=True)
        except requests.exceptions.ConnectionError:
            pass
        except Exception:
            continue
        except AssertionError:
            pass
    return links

Step 2:
- load training and prediction sets 
- twitter training sets are taken from bigram searches ran for the whole of 2021. Filtered by language to remove non english results for now and sampled to even out the two halves of the set. Negative set uses 'digital humanities' and 'problem solving' bigrams, positive set uses bigrams from keyword research 
- prediction sets are taken from bigrams searches during the first 3 months of 2022. 

In [4]:
#description training set 
training_set_even_adds = pd.read_pickle(path+'LOGREG_RELEVANCE/trainingset_even_extended.pkl')
new_training_set = pd.read_pickle(path+'LOGREG_RELEVANCE/new_training_set.pkl')

In [5]:
#negative twitter training set
dh = pd.read_pickle(path+'TWITTER_SEARCHES/NEGATIVE/digital_humanities_2021.pkl')
problem_solving = pd.read_pickle(path+'TWITTER_SEARCHES/NEGATIVE/problem_solving_21.pkl')
twitter_neg = pd.concat([dh, problem_solving])
twitter_neg = twitter_neg.loc[twitter_neg['lang'] == 'en']
twitter_neg['Target'] = '0'
twitter_neg = twitter_neg.sample(n=4379, random_state=56)
twitter_neg = twitter_neg[['tweet', 'Target']].reset_index(drop=True)

In [6]:
#positive twitter training set 
music_collection = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_music_collection.pkl')
song_dataset = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_song_dataset.pkl')
sound_archive = pd.read_pickle(path+'TWITTER_SEARCHES/MJI BIGRAMS/twitter_sound_archive.pkl')
digital_archive = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_digital_archive.pkl')
music_archive = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_music_archive.pkl')
digi_music_archive = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_digital_music_archive.pkl')
midi_file = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_midi_file.pkl')
music_data = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_music_data.pkl')
music_research = pd.read_pickle(path+'TWITTER_SEARCHES/MJI BIGRAMS/twitter_music_research.pkl')
music_dataset = pd.read_pickle(path+'TWITTER_SEARCHES/MUSOW BIGRAMS/twitter_music_dataset.pkl')
twitter_pos = pd.concat([sound_archive, music_collection, digital_archive, music_archive, song_dataset, digi_music_archive, midi_file, music_data, music_research, music_dataset])
twitter_pos = twitter_pos.loc[twitter_pos['lang'] == 'en']
twitter_pos['Target'] = '1'
twitter_pos = twitter_pos[['tweet', 'Target']].reset_index(drop=True)

In [7]:
#final twitter training set
twitter_set = pd.concat([twitter_pos, twitter_neg])
twitter_set['Target'] = twitter_set['Target'].astype('int')
twitter_set = twitter_set.reset_index(drop=True)

In [49]:
#load a prediction set 
prediction_twitter = pd.read_pickle(path+'TWITTER_SEARCHES/PREDICTIONS/music_library_22.pkl')
prediction_twitter = prediction_twitter.loc[prediction_twitter['lang'] == 'en']

In [50]:
len(prediction_twitter)

467

Step 3:
- run the predict function for twitter (logregcv and logreg options)
- filter the results by positives and optionally by inclusion of the 'music' kw
- return a df w/ tweet, prediction value, confidence score, probability, length of input and url 

In [80]:
#variable for removing unwanted results 
discard = ['youtu', '404', 'Not Found', 'bandcamp', 'ebay', 'It needs a human touch', 'Page not found', 'open.spotify.com', 'We\'re sorry...', 'Not Acceptable!', 'Access denied', '412 Error', 'goo.gl', 'instagr.am', 'soundcloud', 'apple.co', 'amzn', 'masterstillmusic', 'Facebook', 'facebook', 'sheetmusiclibrary.website', 'Unsupported browser', 'Last.fm', 'last.fm', 'amazon.com', 'tidal.com', 'tmblr.co']

In [52]:
#run w/ LogRegCV
tweet_predict_cv = lr_model_predict_cv(twitter_set, 'tweet', 'Target', 2, 'precision', prediction_twitter, 'tweet', 'twitter_test_cv', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')

In [53]:
#filter and display results 
tweet_predict_cv_df = tweet_predict_cv.copy()
tweet_predict_cv_df = tweet_predict_cv_df.loc[tweet_predict_cv_df['Prediction'] == 1]
tweet_predict_cv_df = tweet_predict_cv_df[~tweet_predict_cv_df.url.str.contains('|'.join(discard))]
tweet_predict_cv_df = tweet_predict_cv_df.drop_duplicates(subset=['tweet'])
tweet_predict_cv_df = tweet_predict_cv_df.sort_values(by='Score', ascending=False).reset_index(drop=True)
tweet_predict_cv_df = tweet_predict_cv_df[['tweet', 'Prediction', 'Score', 'Probability', 'Input Length', 'url']]
tweet_predict_cv_df

Unnamed: 0,tweet,Prediction,Score,Probability,Input Length,url
0,UofM LIBRARY HIGHLIGHT The Music Library is pr...,1,0.005741,"[-0.6960218653889189, -0.6902807358617171]",307,http://Memphis.edu/libraries
1,New Music up now on all music platforms. Plea...,1,0.005427,"[-0.6958644752026814, -0.6904372496026004]",276,https://bit.ly/JBTWCamP1
2,Download Free Music to Apple Music Library (WI...,1,0.005188,"[-0.6957447538267034, -0.6905563372024054]",100,https://popcornspider.com/download-free-music-...
3,Royalty-Free Music Library | Background Music ...,1,0.005150,"[-0.6957253976812473, -0.6905755935519594]",79,https://www.awin1.com/cread.php?awinmid=22979&...
4,"Copland: 6 piano works (with sheet music) , Sh...",1,0.005084,"[-0.695692297383773, -0.6906085249147889]",93,https://sheetmusiclibrary.website/2022/01/30/c...
...,...,...,...,...,...,...
339,'Hate to say I told you so' As #Spotify is tak...,1,0.000063,"[-0.6931786054194807, -0.6931157566879006]",199,https://powrusr.com/i-love-spotify-but-i-still...
340,This place was taken at the ‘Hyundai Card Musi...,1,0.000048,"[-0.6931711714835507, -0.6931231902118905]",290,http://naver.me/x35imW5q
341,Now Let Us Rejoice : Hymns #ldsmusic https://...,1,0.000039,"[-0.693166773651256, -0.6931275878525163]",61,https://www.churchofjesuschrist.org/music/libr...
342,"Viola and piano recital 🎻🎹 📅 Monday 11 April,...",1,0.000006,"[-0.693150183667844, -0.6931441774610653]",276,https://www.eventbrite.co.uk/e/rescheduled-lun...


In [54]:
#optional filter by kw
tweet_predict_cv_df_kw = tweet_predict_cv_df[tweet_predict_cv_df['tweet'].str.contains('music')]
tweet_predict_cv_df_kw = tweet_predict_cv_df_kw.drop_duplicates(subset=['tweet'])
tweet_predict_cv_df_kw = tweet_predict_cv_df_kw.reset_index(drop=True)
tweet_predict_cv_df_kw 

Unnamed: 0,tweet,Prediction,Score,Probability,Input Length,url
0,UofM LIBRARY HIGHLIGHT The Music Library is pr...,1,0.005741,"[-0.6960218653889189, -0.6902807358617171]",307,http://Memphis.edu/libraries
1,New Music up now on all music platforms. Plea...,1,0.005427,"[-0.6958644752026814, -0.6904372496026004]",276,https://bit.ly/JBTWCamP1
2,"Copland: 6 piano works (with sheet music) , Sh...",1,0.005084,"[-0.695692297383773, -0.6906085249147889]",93,https://sheetmusiclibrary.website/2022/01/30/c...
3,@TheGeekyCrochet @Godzy69 @YouTube use only th...,1,0.004648,"[-0.6954736706071459, -0.6908260905080791]",124,http://Bensound.com
4,Discover the Best Classical Music (with sheet ...,1,0.004565,"[-0.6954320985154429, -0.6908674715547481]",224,https://sheetmusiclibrary.website/2022/02/03/b...
...,...,...,...,...,...,...
227,Clementine 1.3.1 / 1.4.0 RC1 (796) (GPLv3) - A...,1,0.000147,"[-0.6932208780835144, -0.6930734884673009]",207,https://bit.ly/3lOW98i
228,AUDIOPHILE MAN - HIFI NEWS: N10/2 DIGITAL MUSI...,1,0.000122,"[-0.6932081851184521, -0.6930861797227676]",302,https://theaudiophileman.com/n10-2-digital-mus...
229,Work at Glyndebourne! We're looking for a Musi...,1,0.000096,"[-0.693195361566167, -0.6930990018750212]",250,https://www.glyndebourne.com/vacancy/music-lib...
230,Now Let Us Rejoice : Hymns #ldsmusic https://...,1,0.000039,"[-0.693166773651256, -0.6931275878525163]",61,https://www.churchofjesuschrist.org/music/libr...


In [55]:
#run w/ LogReg
tweet_predict = lr_model_predict(twitter_set, 'tweet', 'Target', prediction_twitter, 'tweet', 'twitter_test', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')

In [56]:
#filter and display results 
tweet_predict_df = tweet_predict.copy()
tweet_predict_df = tweet_predict.loc[tweet_predict['Prediction'] == 1]
tweet_predict_df = tweet_predict_df[~tweet_predict_df.url.str.contains('|'.join(discard))]
tweet_predict_df = tweet_predict_df.drop_duplicates(subset=['tweet'])
tweet_predict_df = tweet_predict_df.sort_values(by='Score', ascending=False).reset_index(drop=True)
tweet_predict_df = tweet_predict_df[['tweet', 'Prediction', 'Score', 'Probability', 'Input Length', 'url']]
tweet_predict_df

Unnamed: 0,tweet,Prediction,Score,Probability,Input Length,url
0,New Music up now on all music platforms. Plea...,1,12.414586,"[-12.414590545937914, -4.058940753804271e-06]",276,https://bit.ly/JBTWCamP1
1,UofM LIBRARY HIGHLIGHT The Music Library is pr...,1,12.392676,"[-12.392679715571768, -4.14885717508191e-06]",307,http://Memphis.edu/libraries
2,Royalty-Free Music Library | Background Music ...,1,10.747349,"[-10.74737048926978, -2.150210462177271e-05]",79,https://www.awin1.com/cread.php?awinmid=22979&...
3,Download Free Music to Apple Music Library (WI...,1,10.617486,"[-10.617510016577999, -2.448382715057378e-05]",100,https://popcornspider.com/download-free-music-...
4,Check out our positive pop music today for a f...,1,10.255816,"[-10.255851405365853, -3.515183172147048e-05]",214,https://sozo.lnk.to/Pop
...,...,...,...,...,...,...
360,Put Your Shoulder to the Wheel : Hymns #ldsmu...,1,0.510099,"[-0.9803750514946387, -0.47027624921251115]",73,https://www.churchofjesuschrist.org/music/libr...
361,"There may be a Meter of snow outside, however ...",1,0.433115,"[-0.9329723766830371, -0.4998571469483093]",275,http://www.lighthousedj.com
362,Melco N10/2 digital music library - @HiFiNewsm...,1,0.321097,"[-0.8665287450270682, -0.5454314991454846]",301,https://melco-audio.com/wp-content/uploads/202...
363,Sweet Is the Work : Hymns #ldsmusic https://t...,1,0.133170,"[-0.7619471187912205, -0.6287775065611821]",60,https://www.churchofjesuschrist.org/music/libr...


In [57]:
#optional filter by kw
tweet_predict_df_kw = tweet_predict_df[tweet_predict_df['tweet'].str.contains('music')]
tweet_predict_df_kw = tweet_predict_df_kw.drop_duplicates(subset=['tweet'])
tweet_predict_df_kw = tweet_predict_df_kw.reset_index(drop=True)
tweet_predict_df_kw 

Unnamed: 0,tweet,Prediction,Score,Probability,Input Length,url
0,New Music up now on all music platforms. Plea...,1,12.414586,"[-12.414590545937914, -4.058940753804271e-06]",276,https://bit.ly/JBTWCamP1
1,UofM LIBRARY HIGHLIGHT The Music Library is pr...,1,12.392676,"[-12.392679715571768, -4.14885717508191e-06]",307,http://Memphis.edu/libraries
2,Check out our positive pop music today for a f...,1,10.255816,"[-10.255851405365853, -3.515183172147048e-05]",214,https://sozo.lnk.to/Pop
3,Discover the Best Classical Music (with sheet ...,1,10.092213,"[-10.092254609934887, -4.139982475578892e-05]",224,https://sheetmusiclibrary.website/2022/02/03/b...
4,Streaming music is a great way to discover new...,1,9.793996,"[-9.794051762794236, -5.5783977525212764e-05]",253,https://www.hughesnet.com/media/5-best-music-s...
...,...,...,...,...,...,...
238,Put Your Shoulder to the Wheel : Hymns #ldsmu...,1,0.510099,"[-0.9803750514946387, -0.47027624921251115]",73,https://www.churchofjesuschrist.org/music/libr...
239,"There may be a Meter of snow outside, however ...",1,0.433115,"[-0.9329723766830371, -0.4998571469483093]",275,http://www.lighthousedj.com
240,Melco N10/2 digital music library - @HiFiNewsm...,1,0.321097,"[-0.8665287450270682, -0.5454314991454846]",301,https://melco-audio.com/wp-content/uploads/202...
241,Sweet Is the Work : Hymns #ldsmusic https://t...,1,0.133170,"[-0.7619471187912205, -0.6287775065611821]",60,https://www.churchofjesuschrist.org/music/libr...


Step 4:
- grab links from twitter predictions and scrape them for text 
- return a new table that can be used to predict relevance of URLs 

In [58]:
#URLs to list 
twitter_link_list_cv = [link for link in tweet_predict_cv_df_kw['url'] if 'twitter' not in link]
twitter_link_list = [link for link in tweet_predict_df_kw['url'] if 'twitter' not in link]

In [None]:
#scrape URL list
links_to_add_cv = scrape_links(twitter_link_list_cv)
links_to_add = scrape_links(twitter_link_list)

In [60]:
#remove empty descriptions 
links_to_add_cv = links_to_add_cv[links_to_add_cv.Description != ''].reset_index(drop=True)
links_to_add = links_to_add[links_to_add.Description != ''].reset_index(drop=True)

Step 5:
- run the predict function on scraped URLs 
- return a DF w/ title, description, url, prediction, confidence score, probability and input length

In [89]:
#run with LogRegCV 
twitter_preds_cv = lr_model_predict_cv(training_set_even_adds, 'Description', 'Target', 10, 'f1', links_to_add_cv, 'Description', 'extended_even_model_cv_twitter', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')

In [90]:
#filter results by positive value and score
twitter_preds_cv_df = twitter_preds_cv.copy()
twitter_preds_cv_df = twitter_preds_cv_df.loc[twitter_preds_cv_df['Prediction'] == 1]
twitter_preds_cv_df = twitter_preds_cv_df[~twitter_preds_cv_df.Title.str.contains('|'.join(discard))]
twitter_preds_cv_df = twitter_preds_cv_df[~twitter_preds_cv_df.URL.str.contains('|'.join(discard))]
twitter_preds_cv_df = twitter_preds_cv_df.sort_values(by='Score', ascending=False).reset_index(drop=True)
twitter_preds_cv_df

Unnamed: 0,Title,Description,URL,Prediction,Score,Probability,Input Length
0,UNAQREIA ORIGINAL SOUNDTRACK,総勢35曲もの楽曲群のほか、フルカラー8Pブックレット・ジャケット仕様にて綿密に練られた内部...,http://silkwork-games.com/unaqreia/,1,1.895333,"[-2.035328186246922, -0.13999521188463745]",170
1,Positive Pop,Listen Now,https://sozo.lnk.to/Pop,1,1.863783,"[-2.0079571779985943, -0.14417372987813046]",10
2,"Home - Black, Indigenous, people of colour (BI...",McGill Library • Questions? Ask us!Privacy notice,http://ow.ly/uCXg50DxMwB,1,1.448723,"[-1.659688324608549, -0.2109658155676027]",49
3,Professor - Ezangakini (feat. Shwi & Sun-El Mu...,Choose music service,https://Professor.lnk.to/Ezangakini,1,1.365187,"[-1.5925879957749758, -0.22740076972564838]",20
4,Musa - Joanna,Choose music service,https://Musa.lnk.to/JoannaQR,1,1.365187,"[-1.5925879957749758, -0.22740076972564838]",20
5,Nomfundo Moh - Amagama,Choose music service,https://nomfundomoh.lnk.to/Amagama,1,1.365187,"[-1.5925879957749758, -0.22740076972564838]",20
6,,"©Copyright KUNITACHI COLLEGE OF MUSIC, All Rig...",https://www.arc.ritsumei.ac.jp/lib/vm/kunitake/d/,1,1.171443,"[-1.4414083491522742, -0.2699656937558747]",59
7,Rap Fame | Don't try me,Playlists Download & Record Profile RapFame TV...,https://rapfame.app/i/OdTDY2Bd,1,1.164735,"[-1.4362917233398702, -0.271556764763459]",305
8,a neko desu request robot | nkd.su,next neko desu: april 23rd \nRedo\n \n\n\nRe:Z...,https://nkd.su,1,1.142402,"[-1.419315380126555, -0.27691300856925727]",2442
9,B77A Good Nintentions by IndieSFXMusic,A downloadable music Good Nintentions is a mus...,https://indiesfxmusic.itch.io/indiesfx-b77a,1,0.880585,"[-1.2273896943789924, -0.34680459832484933]",477


In [32]:
twitter_preds_cv_df.iloc[10]['URL']

'https://omny.fm/shows/ongoing-history-of-new-music/history-concert-sound'

In [None]:
#optional filtering step 
twitter_preds_cv_df.loc[twitter_preds_cv_df['Description'].str.contains('data')]

In [91]:
#run with LogReg 
twitter_preds = lr_model_predict(training_set_even_adds, 'Description', 'Target', links_to_add, 'Description', 'extended_even_model_twitter', '/Users/laurentfintoni/Desktop/University/COURSE DOCS/THESIS/Internship/musow-pipeline/')

In [92]:
#filter results by positive value and score
twitter_preds_df = twitter_preds.copy()
twitter_preds_df = twitter_preds_df.loc[twitter_preds_df['Prediction'] == 1]
twitter_preds_df = twitter_preds_df[~twitter_preds_df.Title.str.contains('|'.join(discard))]
twitter_preds_df = twitter_preds_df[~twitter_preds_df.URL.str.contains('|'.join(discard))]
twitter_preds_df = twitter_preds_df.sort_values(by='Score', ascending=False).reset_index(drop=True)
twitter_preds_df

Unnamed: 0,Title,Description,URL,Prediction,Score,Probability,Input Length
0,Positive Pop,Listen Now,https://sozo.lnk.to/Pop,1,2.924242,"[-2.976555137015428, -0.05231287708629835]",10
1,UNAQREIA ORIGINAL SOUNDTRACK,総勢35曲もの楽曲群のほか、フルカラー8Pブックレット・ジャケット仕様にて綿密に練られた内部...,http://silkwork-games.com/unaqreia/,1,2.725974,"[-2.7894013916654123, -0.06342766335245945]",170
2,Derfel Day,"Se accetti i cookie, li useremo per migliorare...",https://www.paypal.com/pools/c/8IoYNIcCaC,1,2.45893,"[-2.5409951224473795, -0.08206503806679508]",220
3,"Home - Black, Indigenous, people of colour (BI...",McGill Library • Questions? Ask us!Privacy notice,http://ow.ly/uCXg50DxMwB,1,1.984848,"[-2.113594397383666, -0.1287462553856395]",49
4,Musa - Joanna,Choose music service,https://Musa.lnk.to/JoannaQR,1,1.79238,"[-1.9464417761417552, -0.15406210278402024]",20
5,Professor - Ezangakini (feat. Shwi & Sun-El Mu...,Choose music service,https://Professor.lnk.to/Ezangakini,1,1.79238,"[-1.9464417761417552, -0.15406210278402024]",20
6,Nomfundo Moh - Amagama,Choose music service,https://nomfundomoh.lnk.to/Amagama,1,1.79238,"[-1.9464417761417552, -0.15406210278402024]",20
7,,"©Copyright KUNITACHI COLLEGE OF MUSIC, All Rig...",https://www.arc.ritsumei.ac.jp/lib/vm/kunitake/d/,1,1.720428,"[-1.8850860580991375, -0.16465767836959178]",59
8,a neko desu request robot | nkd.su,next neko desu: april 23rd \nRedo\n \n\n\nRe:Z...,https://nkd.su,1,1.616145,"[-1.7973519426105826, -0.18120680543738593]",2442
9,Rap Fame | Don't try me,Playlists Download & Record Profile RapFame TV...,https://rapfame.app/i/OdTDY2Bd,1,1.5907,"[-1.7761688221961682, -0.18546906806213606]",305


**TESTING NOTES**
- Twitter search for 'digital archive': returns 200/800 results (logregcv v logreg) at the twitter filtering stage (against 963 results), which can be filtered to around 12-14 using the 'music' keyword. At the URL description filtering stage, the new training set returns 3 URLs for both logreg functions w/ two digital archives featuring text and still image content (no music data) and one news item from the BBC about a fan-led sound system archive in the UK (cv = 10, scoring = precision or f1). With the archive only training set both logreg functions return only one digital archive feat. text content. 
- Twitter search for 'music research': returns 120-160 (logregcv v logreg) at twitter filtering stage (against 192 results), down to 70-90 when further filtered w/ 'music' keyword. At URL description filtering stage we get around 20 results for both logreg functions w/ the new training set, and around 10-16 w/ archive only training set (set to 10/f1 for logregCV). 
- Twitter search for 'digital music archive': returns 5 results for both logreg functions at twitter filtering stage (against 6 results), which can be filtered to one w/ 'music' keyword. At URL description filtering stage both functions and training sets return 2 results. All results at all stages are about the Manchester Music Archive w/ links to collection pages on the site. 
- Twitter search for 'music history': returns 580/674 results (logregcv v logreg) at 1st filtering stage (against 823), which can be filtered to around 335/394 w/ 'music' keyword. At 2nd filtering stage both functions return 50-60 results w/ new training set, with most of the results irrelevant though one of them is for an existing MJI entry (ranks in second half of results). With archive only training set, logregcv returns only 1 result and logreg returns around 30, in both cases the MJI entry isn't included. 
- Twitter search for 'music library': returns 340/360 results (logregcv v logreg) at twitter filtering stage (against 467), filtered down to 230/240 w/ 'music' keyword. At 2nd filtering stage, both functions return 40/50 results w/ new training set, nothing seems particularly relevant and there's non eng language results (not sure how to filter them out?). With archive only training set, both functions return 24 (logregcv at cv=10 and scoring=f1 only).