# Concatenate Datasets 

For this tutorial you will need the `ads_interests.csv` file from last week's tutorial. You'll also need to convert the `your_reels_sentiments.html`, `your_reels_topics.html`, and `your_topics.html` files to csv using the `html2csv` process described last week. These file can be found in the `your_topics` directory of your Instagram export.

### Clean the CSV

Once you have the csv file, remove the first few lines and give the file a header row `type,value`.

In [1]:
import pandas as pd
ads_interests_file = './csv/fake_data/ads_interests.csv'
reels_sentiments_file = './csv/fake_data/your_reels_sentiments.csv'
reels_topics_file = './csv/fake_data/your_reels_topics.csv'
topics_file = './csv/fake_data/your_topics.csv'

### Add a column to distinguish the source

We have 4 data sets with similar columns that can be concatenated together. If we want to add a value to identify the source, we can use the `assign` method to overwrite or create a column. I'll create a new `source` column, but you could also rewrite the `type` column with `.assign(type='source')`

In [2]:
ads_interests = pd.read_csv(ads_interests_file).assign(source='ads_interests')
reels_sentiments = pd.read_csv(reels_sentiments_file).assign(source='reels_sentiments')
reels_topics = pd.read_csv(reels_topics_file).assign(source='reels_topics')
topics = pd.read_csv(topics_file).assign(source='topics')

reels_sentiments

Unnamed: 0,type,value,source
0,Name,cdtqcblc,reels_sentiments
1,Name,cmteqtncl,reels_sentiments
2,Name,cxcqeqnp,reels_sentiments
3,Name,acecqnceqnp,reels_sentiments
4,Name,acnnz,reels_sentiments
5,Name,qnerqqqnp,reels_sentiments
6,Name,qclcxqnp,reels_sentiments
7,Name,ecqrqqeqnp,reels_sentiments


### Concatenate

With the csv files imported, we can use the `concat` method to concatenate all 4 data frames into 1.

In [3]:
concatenated_interests = pd.concat([ads_interests, reels_sentiments, reels_topics, topics]).reset_index()

concatenated_interests

Unnamed: 0,index,type,value,source
0,0,Interest,qkaeaaeakq efqttaeakg,ads_interests
1,1,Interest,aeapdpgy gqqte,ads_interests
2,2,Interest,efqttaeakg,ads_interests
3,3,Interest,tfyeaeazpaea aaeapkqee,ads_interests
4,4,Interest,tfyeaeazpaea qdqgzaeaeq,ads_interests
...,...,...,...,...
1212,480,Name,"qlqly, rsj. ryq jyaayyslg",topics
1213,481,Name,qyrqq ryyyqyjqg,topics
1214,482,Name,jyqq rqyaalqqrj ysraaylg,topics
1215,483,Name,yyllyqrys vylqyr,topics


# Comparing Strings

### Some Background

I'll detail 2 methods of comparing strings. The first method uses the number of matches when aligning the strings to calculate the similarity score. The second method uses matching n-grams to calculate the similarity score.

Both methods calculate the score using the formula

$$S = \frac{2M}{T}, M = \text{number of matches, } T = \text{total number of elements being compared}$$

In the first example, the sequences are aligned to find the number of matching characters.

For example, comparing the strings `their` and `there` gives the following alignment:

```
their
|||d|i
the-re
```

We have 4 matches and 10 elements, so $S = \frac{2*4}{10} = 0.8$.

In this case, we can use `difflib`'s `SequenceMatcher` method to calculate the similarity score.

In [4]:
import difflib

string1 = 'their'
string2 = 'there'

difflib.SequenceMatcher(None, string1, string2).ratio()

0.8

What if we compare something like `data selfie` to `selfie data`?

Although they consist of the same 2 words, the simiarity score is only 0.545455. This is because the algorithm is trying to align the whole string, so it only captures `selfie` as a match:

```
data selfie
ddddd||||||iiiii
-----selfie data
```

$S = \frac{2M}{T} = \frac{2 * 6}{22} = \frac{12}{22} = 0.545455$

In [5]:
string3 = 'data selfie'
string4 = 'selfie data'
difflib.SequenceMatcher(None, string3, string4).ratio()

0.5454545454545454

This is where the second method, using matching n-grams to calculate similarity score, can help.

If we split the strings `data selfie` and `selfie data` into bigrams, we get:

```
{'da', 'at', 'ta', 'a ', ' s', 'se', 'el', 'lf', 'fi', 'ie'}

{'se', 'el', 'lf', 'fi', 'ie', 'e ', ' d', 'da', 'at', 'ta'}
```

We then calculate the similarity score as $\frac{2M}{T}$ where $M$ is the number of matching n-grams, and $T$ is the total number of n-grams.

$S = \frac{2M}{T} = \frac{2 * 8}{20} = \frac{16}{20} = 0.8$


In [6]:
def ngrams(input, n):
  output = []
  for i in range(len(input)-n+1):
      output.append(input[i:i+n])
  return output

def count_matches(ngrams1, ngrams2):
  count = 0
  for i in ngrams1:
    if i in ngrams2:
      count += 1
  return count

def calculate_ngram_score(input1, input2, n):
  ngrams1 = ngrams(input1, n)
  ngrams2 = ngrams(input2, n)
  return 2 * count_matches(ngrams1, ngrams2) / (len(ngrams1) + len(ngrams2))
  

In [7]:
string5 = 'data selfie'
string6 = 'selfie data'

string5_bigrams = ngrams(string5, 2)
string6_bigrams = ngrams(string6, 2)

count = count_matches(string5_bigrams, string6_bigrams)

total = (len(string5_bigrams) + len(string6_bigrams))

score = count * 2 / total

print(string5, 'bigrams:', string5_bigrams)
print(string6, 'bigrams:', string6_bigrams)
print('matching bigrams:', count)
print('total bigrams:', total)
print('score:', score)


data selfie bigrams: ['da', 'at', 'ta', 'a ', ' s', 'se', 'el', 'lf', 'fi', 'ie']
selfie data bigrams: ['se', 'el', 'lf', 'fi', 'ie', 'e ', ' d', 'da', 'at', 'ta']
matching bigrams: 8
total bigrams: 20
score: 0.8


We can tweak our strings a little to increase the similarity. Since we know the words are separated by a space, we can pad the strings with a space on each end.

We now have `' data selfie '` and `' selfie data '` to create the following bigrams.

`{' d', 'da', 'at', 'ta', 'a ', ' s', 'se', 'el', 'lf', 'fi', 'ie', 'e '}`

`{' s', 'se', 'el', 'lf', 'fi', 'ie', 'e ', ' d', 'da', 'at', 'ta', 'a '}`

Since these sets contain all matching bigrams now, the similarity score is 1.

$S = \frac{2*12}{24} = 1.0$

In [8]:
calculate_ngram_score(' data selfie ', ' selfie data ', 2)

1.0

### Applications in Pandas

We can apply these similarity scoring methods to our data frame to add a new `similar` column that lists interests similar to the value in a row.

To do this, let's start by creating an array of all the `interests`. These are the values in the `value` column of our data frame.


In [9]:
interests = concatenated_interests.value.to_numpy()

interests

array(['qkaeaaeakq efqttaeakg', 'aeapdpgy gqqte', 'efqttaeakg', ...,
       'jyqq rqyaalqqrj ysraaylg', 'yyllyqrys vylqyr',
       'aayysqyrs yy yysql rlysqlyyaag'], dtype=object)

Now, we can use `difflib`'s `get_close_matches` method to compare a string to multiple strings (using the alignment matches scoring method).

To do this, let's define a function that applies this method to an input to compare it to the interests array. This method takes an input string, a list of possiblities, a max number of items to return, and a minimum score to include.

In [10]:
get_similar = lambda x: difflib.get_close_matches(x, interests[interests != x], n=5, cutoff=0.5)

We can now `apply` this function to the values in the `value` column and `assign` the output as a new `similar` column.

In [11]:
similar = concatenated_interests.value.apply(get_similar)
with_similar = concatenated_interests.assign(similar=similar)

with_similar

Unnamed: 0,index,type,value,source,similar
0,0,Interest,qkaeaaeakq efqttaeakg,ads_interests,"[tpgpg qkaeaaeakq efqttaeakg, qkaeaaeakq ptqqg..."
1,1,Interest,aeapdpgy gqqte,ads_interests,"[aeapgz gpzqoe, gaeazfpgt gqtgqge, fqaeaqgqqte..."
2,2,Interest,efqttaeakg,ads_interests,"[gqptaeakg, qppaeakg, qkaeaaeakq efqttaeakg, f..."
3,3,Interest,tfyeaeazpaea aaeapkqee,ads_interests,"[tfyeaeazpaea qdqgzaeaeq, zptaeappaea aaea, zf..."
4,4,Interest,tfyeaeazpaea qdqgzaeaeq,ads_interests,"[tfyeaeazpaea aaeapkqee, aeappgp aeaqgzaeaqg, ..."
...,...,...,...,...,...
1212,480,Name,"qlqly, rsj. ryq jyaayyslg",topics,"[qjrqqly rjyaayyslg, yaaylys ryq jyaayysl, qly..."
1213,481,Name,qyrqq ryyyqyjqg,topics,"[yyyqq ryyyqyjqg, qoylq ryyyqyjqg, jyyqq ryyyq..."
1214,482,Name,jyqq rqyaalqqrj ysraaylg,topics,"[yqq rsy aayqrj qyyyyg, yyyyqrj ysraaylq, jyql..."
1215,483,Name,yyllyqrys vylqyr,topics,"[yyllyqysq vylqyr, yyjqlyry vylqyr, lyqy vylqy..."


Now, let's do the same thing with n-gram similarity scoring.


In [12]:
def get_close_matches_ngram(input, possibilities, max, n, cutoff, padding=False):
  output = []
  if padding:
    input = ' ' + input + ' '
  for i in possibilities:
    if len(output) < max:
      if padding:
        score = calculate_ngram_score(input, ' ' + i + ' ', n)
      else:
        score = calculate_ngram_score(input, i, n)
      if score >= cutoff:
        output.append(i)
    else:
      return output
  return output
  

Define a function to get similar strings

`apply` the function to the values in the `value` column

`assign` the returned values to a new `similar` column

In [13]:
get_similar_ngrams = lambda x: get_close_matches_ngram(x, interests[interests != x], 5, 2, 0.8)
similar_ngrams = concatenated_interests.value.apply(get_similar_ngrams)
with_similar_ngrams = concatenated_interests.assign(similar=similar_ngrams)

with_similar_ngrams

Unnamed: 0,index,type,value,source,similar
0,0,Interest,qkaeaaeakq efqttaeakg,ads_interests,"[efqttaeakg, tpgpg qkaeaaeakq efqttaeakg]"
1,1,Interest,aeapdpgy gqqte,ads_interests,[]
2,2,Interest,efqttaeakg,ads_interests,[]
3,3,Interest,tfyeaeazpaea aaeapkqee,ads_interests,[]
4,4,Interest,tfyeaeazpaea qdqgzaeaeq,ads_interests,[]
...,...,...,...,...,...
1212,480,Name,"qlqly, rsj. ryq jyaayyslg",topics,[]
1213,481,Name,qyrqq ryyyqyjqg,topics,"[yyyqq ryyyqyjqg, lyyyrsqq ryyyqyjqg, jyyqq ry..."
1214,482,Name,jyqq rqyaalqqrj ysraaylg,topics,[]
1215,483,Name,yyllyqrys vylqyr,topics,"[jyqsrqrys vylqyr, yyllyqysq vylqyr, lyyqrl vy..."


We can get rid of the rows with no similar matches by filtering the data frame to only include items where the length of the `similar` value is not 0.

In [14]:
filtered_similar = with_similar_ngrams[with_similar_ngrams['similar'].map(lambda x: len(x)) != 0]

filtered_similar

Unnamed: 0,index,type,value,source,similar
0,0,Interest,qkaeaaeakq efqttaeakg,ads_interests,"[efqttaeakg, tpgpg qkaeaaeakq efqttaeakg]"
5,5,Interest,apaeaaeaaeay,ads_interests,"[kqpaaeaaead, etqpaeaay, zpyapaeag, pgpaeaaeap..."
10,10,Interest,zqaeaqty aeaqqaeaqe,ads_interests,"[qaeatqq gpaeaqe, fqaeaqgqqte, epqqaeaeqgaeaqe..."
11,11,Interest,pzpaeaqk aeaqqaeaqe,ads_interests,"[qaeatqq gpaeaqe, qtpzppaeaqk, aeaqqaeaqe, zpq..."
12,12,Interest,oqqre,ads_interests,[q-oqqre]
...,...,...,...,...,...
1208,476,Name,aarqqrqqryyr ryq qqyqlg,topics,[qlyyqry ryq qqyqlg]
1210,478,Name,lqyjyqrys rsqyqqyl,topics,[lyqorys rsqyqqyl]
1213,481,Name,qyrqq ryyyqyjqg,topics,"[yyyqq ryyyqyjqg, lyyyrsqq ryyyqyjqg, jyyqq ry..."
1215,483,Name,yyllyqrys vylqyr,topics,"[jyqsrqrys vylqyr, yyllyqysq vylqyr, lyyqrl vy..."


We can also apply padding to our strings when comparing n-grams.

In [15]:
get_similar_ngrams_padded = lambda x: get_close_matches_ngram(x, interests[interests != x], 5, 2, 0.8, True)
similar_ngrams_padded = concatenated_interests.value.apply(get_similar_ngrams_padded)
with_similar_ngrams_padded = concatenated_interests.assign(similar=similar_ngrams_padded)
filtered_similar_padded = with_similar_ngrams_padded[with_similar_ngrams_padded['similar'].map(lambda x: len(x)) != 0]

filtered_similar_padded

Unnamed: 0,index,type,value,source,similar
0,0,Interest,qkaeaaeakq efqttaeakg,ads_interests,"[efqttaeakg, tpgpg qkaeaaeakq efqttaeakg]"
5,5,Interest,apaeaaeaaeay,ads_interests,"[etqpaeaay, pgpaeaaeap, apoaaeapapk, fpaeaaea,..."
6,6,Interest,faeat fqt aeapeaeaz,ads_interests,"[aeapeaeaz, fqp pqtaeaz]"
10,10,Interest,zqaeaqty aeaqqaeaqe,ads_interests,"[fqaeaqgqqte, aeakeppkp zqaaqq, aeaqqaeaqe, ae..."
11,11,Interest,pzpaeaqk aeaqqaeaqe,ads_interests,"[qtpzppaeaqk, aeaqqaeaqe, gpaeaqe fpgtqk, aeaq..."
...,...,...,...,...,...
1209,477,Name,yyyyqrj ysraaylq,topics,[yyraayqlq]
1210,478,Name,lqyjyqrys rsqyqqyl,topics,[lyqorys rsqyqqyl]
1213,481,Name,qyrqq ryyyqyjqg,topics,"[qoylq ryyyqyjqg, yyyqq ryyyqyjqg, lyyyrsqq ry..."
1215,483,Name,yyllyqrys vylqyr,topics,[yyllyqysq vylqyr]
