# Data Generation and Preparation

Here we'll gather the data we want to fine tune with as well as some preparation to for a model to be able to intake the data

In [18]:
# Imports
import pandas as pd
import os
from google.cloud import bigquery

# Option
pd.set_option('display.max_colwidth', 500)

### Data Generation

As with many Gen AI models we want to be able to ask it a question and get an answer back. One of the best repositories to find a large collection of questions and answers it the online Programming Language Library of Alexandria otherwise known as Stack Overflow.

There are several ways to get Stack Overflow Data:
- Use an API to query Stack Overflow Directly
- Download a publicly available data dump and query from that
- Query the data from Google BigQuery

Were this a long term project creating a framework to collect data from Stack Overflow's APIs would be best, however as this is an exploratory project we'll query the data from BigQuery since this will be best tradeoff between simplicity and speed.

---

##### Credentials

In order to access Google BigQuery we need to create an account and create access credentials for our queries to work. While this does require a Google Cloud Account we the volume of the query is low enough that there are no charges.
- [Account Creation](https://cloud.google.com/)
- [Project Creation](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
- [Credential Creation](https://developers.google.com/workspace/guides/create-credentials)

For this repository to run:
1.  Create a folder titled `credentials` one directory above your repository
2.  Save the json created by Google Cloud Here
3.  Rename the file `gen-ai-hackathon.json`

In [2]:
# This sets an environmental variable containing all the information Big Query needs to allow the query to run
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '../credentials/gen-ai-hackathon.json'

#### Query
Here we write our SQL for querying the data from Big Query. The ultimate goal is to create a grouping of question/answer pairs that we can further process into a corpus that we can tokenize for the model to fine tune on.

In [3]:
SO_MONGO_QUERY = """
SELECT
    CONCAT(q.title, q.body) as question,
    a.body AS answer
FROM
    `bigquery-public-data.stackoverflow.posts_questions` q
JOIN
    `bigquery-public-data.stackoverflow.posts_answers` a
ON
    q.accepted_answer_id = a.id
WHERE
    q.accepted_answer_id IS NOT NULL AND -- This is to return only questions with answers
    REGEXP_CONTAINS(q.tags, "mongo")
"""

In [4]:
stack_overflow_df = bigquery.Client().query(SO_MONGO_QUERY).result().to_arrow().to_pandas()

In [5]:
print(f'The Query Returned {stack_overflow_df.shape[0]} rows')

The Query Returned 87090 rows


#### Data Preparation

As we see from the above our query returned over 80,000 rows of question/answer pairs. However in order begin tuning we'll need to transform this into something a model can interpret.

For this we'll need not a tabular data set but a combined text string of both the question and answer that we can feed into the model.

The function below will merge both the question and answer columns in a format that the model can learn from.  
<br />
  
```python
### Question: {Question Column} 

### Answer: {Answer Column}
```

In [6]:
def template_formatter(data):
    output = []
    for i in range(len(data['question'])):
        text = f"### Question: {data['question'][i]}\n ### Answer: {data['answer'][i]}"
        output.append(text)
    return output

Next we'll use the function above to create a list with each entry in the Q/A template formate 

In [7]:
template_formatted_data = template_formatter(stack_overflow_df)
template_formatted_data[:5]

["### Question: Mongoose findById is not returning all fields<p>I'm calling findById using mongoose and it's not returning all fields, or at least it's not mapping to a field correctly. But it returns that field if I use aggregate</p>\n<p>I have the following schema</p>\n<pre><code>const ratingSchema = new mongoose.Schema({\n    rating: {\n        type: Number,\n        default: 0,\n        min: 0,\n        max: 5\n    }\n})\n\nconst locationSchema = new mongoose.Schema({ \n    name: {\n        type: String,\n        required: true,\n    },\n    address: {\n        type: String,\n        required: true,\n    },\n    rating: ratingSchema,\n    \n    facilities: [String],\n});\n\nlocationSchema.index({coords: '2dsphere'});\n\nmongoose.model('Location', locationSchema);\n</code></pre>\n<p>When I call</p>\n<pre><code>const Loc = mongoose.model('Location');\n\nconst locationsReadOne = (req, res) =&gt; {\n    Loc\n        .findById(req.params.locationid)\n        .exec((err, location) =&gt; 

Finally we'll format this list into a csv with the header of `text`. This header will be important for the later workflow.

First let's create dataframe so we can take a peak at what our new corpus looks like.

In [19]:
template_formatted_df = pd.DataFrame(template_formatted_data, columns=["text"])
template_formatted_df.head()

Unnamed: 0,text
0,"### Question: Mongoose findById is not returning all fields<p>I'm calling findById using mongoose and it's not returning all fields, or at least it's not mapping to a field correctly. But it returns that field if I use aggregate</p>\n<p>I have the following schema</p>\n<pre><code>const ratingSchema = new mongoose.Schema({\n rating: {\n type: Number,\n default: 0,\n min: 0,\n max: 5\n }\n})\n\nconst locationSchema = new mongoose.Schema({ \n name: {\n ..."
1,### Question: Why is my mongo collection being wiped on azure ubuntu instance?<p>I'm using azure ubuntu instance to store some data every minute in a mongo database. I noticed that the data is being wiped approximately once a day. I'm wondering why my data is being wiped?</p>\n<p>I have a log every minute that shows a count of the db. Here are two consecutive minutes that show all records are deleted</p>\n<pre><code>**************************************\nupdate at utc: 2022-08-06 10:19:02.3...
2,"### Question: MongoDb score results based on simple matches<p>I'm trying to create a simple search algorithm that will try to match against a first name, last name, and/or set of tags, as an example:</p>\n<pre><code>[\n {\n &quot;key&quot;: 1,\n &quot;fname&quot;: &quot;Bob&quot;,\n &quot;lname&quot;: &quot;Smith&quot;,\n &quot;tags&quot;: [\n &quot;a&quot;,\n &quot;b&quot;,\n &quot;c&quot;\n ]\n },\n {\n &quot;key&quot;: 2,\n &quot;fname&quot;: &quot;J..."
3,"### Question: Laravel 5.7 mongodb atlas connection problem using jenssegers/mongodb<p>I want to connect my laravel 5.7 application(I used the 3.4 version of jenssegers/mongodb) with a mongodb in atlas, I tried in localhost(I isntalled the mongo extension) and everything is ok but with atlas i got an error message:</p>\n<blockquote>\n<p>Failed to parse MongoDB URI:\n'mongodb://root%3Acluster0.xxx.mongodb.net%3A27017%2Fhddatabase%3FretryWrites%3Dtrue%26w%3Dmajority'.\nInvalid host string in U..."
4,### Question: Remote Mongo DB connection through shell scripts<p>I have to establish connection to Remote mongo database through shell script and execute a js file with mongo queries Please help me with the commands..</p>\n ### Answer: <p>In general:</p>\n<pre><code> mongo mongodb://user:&lt;pass&gt;@&lt;ip&gt;:27017/test&amp;authSource=admin myjsfile.js\n</code></pre>


Finally we'll save that dataframe as a csv file to be used later

In [11]:
template_formatted_df.to_csv("./data/csv_template_formatted.csv", index=False)

Now that we have our data formatted we can move onto bringing a model into our local environment and begin the process of fine tuning.