
# Dataset Generation

Upload dataset to your Google Drive <br>
Dataset: https://www.kaggle.com/datasets/fabiochiusano/medium-articles

In [None]:
!pip install openai

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import openai

Loading datset

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Newron/medium_articles.csv')

In [None]:
df.head(2)

In [None]:
wanted_tags = ['AI', 'Artificial Intelligence', 'Machine Learning', 'Data Science']

In [None]:
def keep_or_delete(row):
    for tag in wanted_tags:
      if tag in row['tags']:
        return 'keep'
        
    return 'delete'

In [None]:
df['keep_or_delete'] = df.apply(keep_or_delete, axis=1)
ml_df = df.loc[df['keep_or_delete'] == 'keep']

Filtering out Articles with less than 1024 words

In [None]:
def wordlen(row):
  return len(row['text'].split(" "))

In [None]:
ml_df['wordlen'] = ml_df.apply(wordlen, axis=1)
ml_df = ml_df.loc[ml_df['wordlen'] < 1024]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ml_df['wordlen'] = ml_df.apply(wordlen, axis=1)


Saving the 7000+ ML article dataset for future use.

In [None]:
ml_df.to_csv('Medium_ML_artciles_7k_1000words.csv', header=False, index=False)

## Sampling out 1000 articles for Specific tags, using GPT3 API

In [None]:
final_sample_df = ml_df.sample(n=1000)

In [None]:
OPENAI_API_KEY="sk-XXXX_INSERT_YOUR_KEY_XXXX"

In [None]:
openai.api_key = OPENAI_API_KEY

In [None]:
promptHead = "Generate exactly 5 tags related to the domain for this article:"

In [None]:
def tag_generation(text,flag=0):
    prompt = "\n".join([promptHead,text])
    if(flag<3):
        try:
          response = openai.Completion.create(
              model="text-davinci-003",
              prompt=prompt,
              temperature=0.5,
              max_tokens=60,
              top_p=1,
              frequency_penalty=0.8,
              presence_penalty=0
          )
          # print(response)
          tags = (", ".join([hashTag.replace("#",'') for hashTag in (((response.choices[0].text).replace("\n","")).split(","))[:5]])).strip()
          return tags
        except:
          flag+=1
          return tag_generation(text,flag)
    # 3rd try
    else:
      return "GPT3_API_FAILED"



In [None]:
sample_text = """
A quick and simple explanation of a very important issue.

In order to innovate, business analysts, doctors, and researchers need to ask iterative, detailed questions about their customers, patients, and experiments, respectively â€” but analytics dashboards are only giving them the same old query tools, KPIs and visualizations.

Itâ€™s not working.

Hence, those folks turn to data specialists for help: i.e. statisticians, data scientists, database administrators.

Data specialists are akin to Sherpas like Tenzing Norgay pictured above â€” they alone know the terrain, and they alone have the skills to get you up and down the mountain (of data).

The first advantage of asking a human data specialist for help is that you can effectively communicate your question, in your language to them. Maybe it takes a few back-and-forth emails, maybe it takes an in-person meeting. But afterwards you can (usually) trust theyâ€™ll produce an answer with value. The second advantage is that the data specialists do all the work of writing code, queries, data blending and computation â€” all you need to do is wait. To many folks without data skills, this part can feel like magic. The third advantage is that data specialists can explain the analysis results back to you, in your language, helping you understand if and how the results answer your question.

The crazy thing about the process is how long it takes. Just writing the first email or scheduling the first meeting with a data specialist takes minutes to hours. Then thereâ€™s all the time they have to spend gathering data, slicing it, invoking algorithms, and preparing results. This usually takes a data specialist days to weeks, sometimes months. It becomes impossible to iterate through a series of follow-up questions, because it takes too damn long just to get the first answer.

This is the Last Mile of data analysis, and itâ€™s baffling why people take this human-dependent bottleneck for granted. Thereâ€™s no magic in the process â€” it can be codified, automated, and accelerated via software. Getting an answer to a complex question should be faster than the time it takes to send an email or schedule a meeting with a data specialist.

We have machines to automatically climb mountains (albeit smaller ones than Everest).

How can we solve this Last Mile problem with software? Look to the three advantages of human data specialists described above:

The system must allow the end user to ask a complex data question in their own language. This doesnâ€™t necessarily mean using natural language processing like Siri or Alexa. â€œSpeaking their own languageâ€ is more often solved by good User Experience. Guide the user into a point-and-click workflow where they understand the topic, parameters, and actions available. The system must be able to do all the sophisticated+messy work necessary to answer the question, quickly. This means proper integration and slicing of data, and utilizing proper algorithms-statistics-machine learning. In other words â€” integrating the right data and tools for each successive question that arises, quickly and at scale. The system must be able to explain data analysis results back to the user in their own language. User Experience, again, but with less insistence that visualization can do it all â€” contrary to conventional wisdom, visualization is often useless â€” and instead more dynamic text that describes each result clearly and natively in the context of the question asked (i.e. automated and concise figure captions).

I had hoped to keep this short, so Iâ€™m just going to leave it at that.

If youâ€™d like to know more about how Tag.bio solves this problem in business, healthcare and life sciences, drop me a line.
"""

In [None]:
tag_generation(sample_text)

'Data Analysis,  Data Specialists,  Automation,  Last Mile Problem,  User Experience'

In [None]:
def gpt3_tags(row):
  tags = tag_generation(row['text'])
  return tags

In [None]:
final_sample_df['specific_tags'] = final_sample_df.apply(gpt3_tags, axis=1)

In [None]:
final_sample_df

Unnamed: 0,title,text,url,authors,timestamp,tags,keep_or_delete,wordlen,specific_tags
11247,How to Build a Reporting Dashboard using Dash ...,A method to select either a condensed data tab...,https://medium.com/p/4f4257c18a7f#58a3,['David Comfort'],2019-03-13 14:21:44.055000+00:00,"['Dash', 'Dashboard', 'Data Science', 'Data Vi...",keep,668,"data table, radio button, Dash data table, ..."
163374,A Brief Introduction to Change Point Detection...,The ruptures Package\n\nCharles Truong adapted...,https://towardsdatascience.com/a-brief-introdu...,['Kirsten Perry'],2019-08-15 11:16:23.649000+00:00,"['Programming', 'Time Series Analysis', 'Stati...",keep,757,"ruptures, changepoint detection, Python, PE..."
37600,Demystifying Tensorflow Estimator,Tensorflow is one of the most popular machine ...,https://medium.com/analytics-vidhya/demystifyi...,['Junaid Khattak'],2020-10-31 13:27:17.250000+00:00,"['Machine Learning', 'Tensorflow2', 'Tensorboa...",keep,540,"tags: Tensorflow, Machine Learning, Estimato..."
37858,PixieDust gets its first community-driven feat...,PixieDust gets its first community-driven feat...,https://medium.com/codait/pixiedust-gets-its-f...,['David Taieb'],2017-04-13 19:48:39.848000+00:00,"['Apache Spark', 'Data Visualization', 'Data S...",keep,194,"PixieDust, PyPi, JupyterNotebooks, PySpark,..."
2642,5 Useful Image Manipulation Techniques Using P...,Introduction\n\nAlthough many programmers don’...,https://medium.com/better-programming/5-useful...,['Yong Cui'],2020-10-01 15:04:50.742000+00:00,"['Machine Learning', 'Python3', 'Artificial In...",keep,440,"Tags: Image Manipulation, OpenCV, Numpy Arra..."
...,...,...,...,...,...,...,...,...,...
8218,The Missing List of JupyterLab Keyboard Shortcuts,"With keyboard shortcuts, you can whiz around J...",https://towardsdatascience.com/the-missing-lis...,['Jeff Hale'],2020-10-16 21:45:19.823000+00:00,"['Machine Learning', 'Data Science', 'Technolo...",keep,102,"jupyterlab, keyboardshortcuts, python, sql,..."
116252,Image Classification: Cats and Dogs — Pre-trai...,Chapter 4 of Neural Network Projects with Pyth...,https://medium.com/analytics-vidhya/image-clas...,['Mark Subra'],2020-10-20 12:51:26.151000+00:00,"['Image Classification', 'Neural Networks', 'D...",keep,688,"Tags: Convolutional Neural Network, CNN, Ima..."
37687,Functional Safety Concept for Self Driving Cars,Functional Safety Requirements\n\nGoing back t...,https://medium.com/swlh/functional-safety-conc...,['Prateek Sawhney'],2020-12-05 19:19:53.542000+00:00,"['Artificial Intelligence', 'Machine Learning'...",keep,205,"FunctionalSafety, LaneDepartureWarning, MaxT..."
91189,RL World,RL World\n\nReinforcement Learning is a very v...,https://medium.com/@adabhishekdabas/rl-world-3...,['Abhishek Dabas'],2020-12-19 22:17:16.490000+00:00,"['Data Science', 'OpenAI', 'Artificial Intelli...",keep,255,contextual and non-contextual.ReinforcementLea...


Saving GPT3 Generated tags for Postprocessing

In [None]:
final_sample_df.to_csv('Medium_ML_SpecificTags_1000articles_1000words.csv', header=True, index=False)

In [None]:
final_sample_df['specific_tags']

11247     data table,  radio button,  Dash data table,  ...
163374    ruptures,  changepoint detection,  Python,  PE...
37600     tags: Tensorflow,  Machine Learning,  Estimato...
37858     PixieDust,  PyPi,  JupyterNotebooks,  PySpark,...
2642      Tags: Image Manipulation,  OpenCV,  Numpy Arra...
                                ...                        
8218      jupyterlab,  keyboardshortcuts,  python,  sql,...
116252    Tags: Convolutional Neural Network,  CNN,  Ima...
91189     contextual and non-contextual.ReinforcementLea...
18946     You could find more details about this feature...
Name: specific_tags, Length: 1000, dtype: object

## GPT3 Tag repair


In [None]:
final_sample_df

Unnamed: 0,title,text,url,authors,timestamp,tags,keep_or_delete,wordlen,specific_tags
0,How to Build a Reporting Dashboard using Dash ...,A method to select either a condensed data tab...,https://medium.com/p/4f4257c18a7f#58a3,['David Comfort'],2019-03-13 14:21:44.055000+00:00,"['Dash', 'Dashboard', 'Data Science', 'Data Vi...",keep,668,"data table, radio button, Dash data table, ..."
1,A Brief Introduction to Change Point Detection...,The ruptures Package\n\nCharles Truong adapted...,https://towardsdatascience.com/a-brief-introdu...,['Kirsten Perry'],2019-08-15 11:16:23.649000+00:00,"['Programming', 'Time Series Analysis', 'Stati...",keep,757,"ruptures, changepoint detection, Python, PE..."
2,Demystifying Tensorflow Estimator,Tensorflow is one of the most popular machine ...,https://medium.com/analytics-vidhya/demystifyi...,['Junaid Khattak'],2020-10-31 13:27:17.250000+00:00,"['Machine Learning', 'Tensorflow2', 'Tensorboa...",keep,540,"tags: Tensorflow, Machine Learning, Estimato..."
3,PixieDust gets its first community-driven feat...,PixieDust gets its first community-driven feat...,https://medium.com/codait/pixiedust-gets-its-f...,['David Taieb'],2017-04-13 19:48:39.848000+00:00,"['Apache Spark', 'Data Visualization', 'Data S...",keep,194,"PixieDust, PyPi, JupyterNotebooks, PySpark,..."
4,5 Useful Image Manipulation Techniques Using P...,Introduction\n\nAlthough many programmers don’...,https://medium.com/better-programming/5-useful...,['Yong Cui'],2020-10-01 15:04:50.742000+00:00,"['Machine Learning', 'Python3', 'Artificial In...",keep,440,"Tags: Image Manipulation, OpenCV, Numpy Arra..."
...,...,...,...,...,...,...,...,...,...
995,The Missing List of JupyterLab Keyboard Shortcuts,"With keyboard shortcuts, you can whiz around J...",https://towardsdatascience.com/the-missing-lis...,['Jeff Hale'],2020-10-16 21:45:19.823000+00:00,"['Machine Learning', 'Data Science', 'Technolo...",keep,102,"jupyterlab, keyboardshortcuts, python, sql,..."
996,Image Classification: Cats and Dogs — Pre-trai...,Chapter 4 of Neural Network Projects with Pyth...,https://medium.com/analytics-vidhya/image-clas...,['Mark Subra'],2020-10-20 12:51:26.151000+00:00,"['Image Classification', 'Neural Networks', 'D...",keep,688,"Tags: Convolutional Neural Network, CNN, Ima..."
997,Functional Safety Concept for Self Driving Cars,Functional Safety Requirements\n\nGoing back t...,https://medium.com/swlh/functional-safety-conc...,['Prateek Sawhney'],2020-12-05 19:19:53.542000+00:00,"['Artificial Intelligence', 'Machine Learning'...",keep,205,"FunctionalSafety, LaneDepartureWarning, MaxT..."
998,RL World,RL World\n\nReinforcement Learning is a very v...,https://medium.com/@adabhishekdabas/rl-world-3...,['Abhishek Dabas'],2020-12-19 22:17:16.490000+00:00,"['Data Science', 'OpenAI', 'Artificial Intelli...",keep,255,contextual and non-contextual.ReinforcementLea...


In [None]:
def gpt3_tags_repair(row):
  '''
  GPT3 Generated tags had certain tags with "Tags:" in the beginning.
  Others were useless with random sentences as tags.
  '''
  tag_string =row['specific_tags']
  if("tags:" in tag_string):
    tag_string = tag_string.split("tags:")[1]
  elif("Tags:" in tag_string):
    tag_string = tag_string.split("Tags:")[1]
  tags = [tag.strip() for tag in tag_string.split(",") if(len(tag.strip())>0)]

  # Accepting only 4 or 5 tagged examples

  if(len(tags)!=5 and len(tags)!=4):
    print(tags)
    return "rejected"
  else:
    for count, tag in enumerate(tags):
      try:
        flag = 0
        temp_tag=""

        # Splitting Two words eg: AbcdeFghi --> Abcde Fghi
        temp_tag+=tag[0]
        for i in range(1,len(tag)-1):
          if(tag[i].isupper() and tag[i-1].islower() and tag[i+1].islower()):
            flag=1
            temp_tag+=" "
          temp_tag+=tag[i]
        temp_tag+=tag[-1]
        tags[count] = temp_tag
        if(flag):
          print("Corrected: ",tags[count])
      except:
        print("\n\nTAGS: ",tags)
  return ", ".join(tags)

In [None]:
final_sample_df['corrected_tags'] = final_sample_df.apply(gpt3_tags_repair, axis=1)

In [None]:
final_sample_df

Unnamed: 0,title,text,url,authors,timestamp,tags,keep_or_delete,wordlen,specific_tags,corrected_tags
0,How to Build a Reporting Dashboard using Dash ...,A method to select either a condensed data tab...,https://medium.com/p/4f4257c18a7f#58a3,['David Comfort'],2019-03-13 14:21:44.055000+00:00,"['Dash', 'Dashboard', 'Data Science', 'Data Vi...",keep,668,"data table, radio button, Dash data table, ...","data table, radio button, Dash data table, dop..."
1,A Brief Introduction to Change Point Detection...,The ruptures Package\n\nCharles Truong adapted...,https://towardsdatascience.com/a-brief-introdu...,['Kirsten Perry'],2019-08-15 11:16:23.649000+00:00,"['Programming', 'Time Series Analysis', 'Stati...",keep,757,"ruptures, changepoint detection, Python, PE...","ruptures, changepoint detection, Python, PELT,..."
2,Demystifying Tensorflow Estimator,Tensorflow is one of the most popular machine ...,https://medium.com/analytics-vidhya/demystifyi...,['Junaid Khattak'],2020-10-31 13:27:17.250000+00:00,"['Machine Learning', 'Tensorflow2', 'Tensorboa...",keep,540,"tags: Tensorflow, Machine Learning, Estimato...","Tensorflow, Machine Learning, Estimator API, K..."
3,PixieDust gets its first community-driven feat...,PixieDust gets its first community-driven feat...,https://medium.com/codait/pixiedust-gets-its-f...,['David Taieb'],2017-04-13 19:48:39.848000+00:00,"['Apache Spark', 'Data Visualization', 'Data S...",keep,194,"PixieDust, PyPi, JupyterNotebooks, PySpark,...","Pixie Dust, Py Pi, Jupyter Notebooks, Py Spark..."
4,5 Useful Image Manipulation Techniques Using P...,Introduction\n\nAlthough many programmers don’...,https://medium.com/better-programming/5-useful...,['Yong Cui'],2020-10-01 15:04:50.742000+00:00,"['Machine Learning', 'Python3', 'Artificial In...",keep,440,"Tags: Image Manipulation, OpenCV, Numpy Arra...","Image Manipulation, OpenCV, Numpy Array, BGR S..."
...,...,...,...,...,...,...,...,...,...,...
995,The Missing List of JupyterLab Keyboard Shortcuts,"With keyboard shortcuts, you can whiz around J...",https://towardsdatascience.com/the-missing-lis...,['Jeff Hale'],2020-10-16 21:45:19.823000+00:00,"['Machine Learning', 'Data Science', 'Technolo...",keep,102,"jupyterlab, keyboardshortcuts, python, sql,...","jupyterlab, keyboardshortcuts, python, sql, do..."
996,Image Classification: Cats and Dogs — Pre-trai...,Chapter 4 of Neural Network Projects with Pyth...,https://medium.com/analytics-vidhya/image-clas...,['Mark Subra'],2020-10-20 12:51:26.151000+00:00,"['Image Classification', 'Neural Networks', 'D...",keep,688,"Tags: Convolutional Neural Network, CNN, Ima...","Convolutional Neural Network, CNN, Image Class..."
997,Functional Safety Concept for Self Driving Cars,Functional Safety Requirements\n\nGoing back t...,https://medium.com/swlh/functional-safety-conc...,['Prateek Sawhney'],2020-12-05 19:19:53.542000+00:00,"['Artificial Intelligence', 'Machine Learning'...",keep,205,"FunctionalSafety, LaneDepartureWarning, MaxT...","Functional Safety, Lane Departure Warning, Max..."
998,RL World,RL World\n\nReinforcement Learning is a very v...,https://medium.com/@adabhishekdabas/rl-world-3...,['Abhishek Dabas'],2020-12-19 22:17:16.490000+00:00,"['Data Science', 'OpenAI', 'Artificial Intelli...",keep,255,contextual and non-contextual.ReinforcementLea...,contextual and non-contextual.Reinforcement Le...


In [None]:
# Removing flag columns and intermediate state columns of tags.
final_sample_df = final_sample_df.drop(['tags','keep_or_delete', 'specific_tags'], axis=1)

# Removing useless tags
final_sample_df = final_sample_df.loc[df['corrected_tags'] != 'rejected']

In [None]:
final_sample_df

Unnamed: 0,title,text,url,authors,timestamp,wordlen,corrected_tags
0,How to Build a Reporting Dashboard using Dash ...,A method to select either a condensed data tab...,https://medium.com/p/4f4257c18a7f#58a3,['David Comfort'],2019-03-13 14:21:44.055000+00:00,668,"data table, radio button, Dash data table, dop..."
1,A Brief Introduction to Change Point Detection...,The ruptures Package\n\nCharles Truong adapted...,https://towardsdatascience.com/a-brief-introdu...,['Kirsten Perry'],2019-08-15 11:16:23.649000+00:00,757,"ruptures, changepoint detection, Python, PELT,..."
2,Demystifying Tensorflow Estimator,Tensorflow is one of the most popular machine ...,https://medium.com/analytics-vidhya/demystifyi...,['Junaid Khattak'],2020-10-31 13:27:17.250000+00:00,540,"Tensorflow, Machine Learning, Estimator API, K..."
3,PixieDust gets its first community-driven feat...,PixieDust gets its first community-driven feat...,https://medium.com/codait/pixiedust-gets-its-f...,['David Taieb'],2017-04-13 19:48:39.848000+00:00,194,"Pixie Dust, Py Pi, Jupyter Notebooks, Py Spark..."
4,5 Useful Image Manipulation Techniques Using P...,Introduction\n\nAlthough many programmers don’...,https://medium.com/better-programming/5-useful...,['Yong Cui'],2020-10-01 15:04:50.742000+00:00,440,"Image Manipulation, OpenCV, Numpy Array, BGR S..."
...,...,...,...,...,...,...,...
995,The Missing List of JupyterLab Keyboard Shortcuts,"With keyboard shortcuts, you can whiz around J...",https://towardsdatascience.com/the-missing-lis...,['Jeff Hale'],2020-10-16 21:45:19.823000+00:00,102,"jupyterlab, keyboardshortcuts, python, sql, do..."
996,Image Classification: Cats and Dogs — Pre-trai...,Chapter 4 of Neural Network Projects with Pyth...,https://medium.com/analytics-vidhya/image-clas...,['Mark Subra'],2020-10-20 12:51:26.151000+00:00,688,"Convolutional Neural Network, CNN, Image Class..."
997,Functional Safety Concept for Self Driving Cars,Functional Safety Requirements\n\nGoing back t...,https://medium.com/swlh/functional-safety-conc...,['Prateek Sawhney'],2020-12-05 19:19:53.542000+00:00,205,"Functional Safety, Lane Departure Warning, Max..."
998,RL World,RL World\n\nReinforcement Learning is a very v...,https://medium.com/@adabhishekdabas/rl-world-3...,['Abhishek Dabas'],2020-12-19 22:17:16.490000+00:00,255,contextual and non-contextual.Reinforcement Le...


## Saving the final dataset.

In [None]:
final_sample_df.to_csv('Medium_ML_Specific_Refined_Tags_940articles_1000words.csv', header=True, index=False)