In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("Assignment4_COMM187Spring2024.ipynb")

# Coding Assignment #4
## COMM187: Data Science in Communication Research, Spring 2024

**!!! Please make sure to run the first cell before running auto-grader !!!**

<h3><span style="color:green">  Data Analysis of Emotional Tone in Text Messaging ðŸ’¬ </span> </h3>

You are a data scientist working on a study to understand how people's emotional expressions in text messages change throughout the day. 

Your team has collected data in the form of text messages sent over different times of the day. Each message has been pre-classified with one of three emotional tones: <span style="color:green">"positive"</span>, <span style="color:blue">"neutral"</span>, or <span style="color:red">"negative"</span>.

For example, in the following hypothetical text messages between friends named Franco and Jamie, some messages are <span style="color:green">"positive"</span> in tone, some are <span style="color:blue">"neutral"</span> in tone, and some are <span style="color:red">"negative"</span>.

<img src="./imgs/textmessagesample.png" alt="drawing" width="500"/>

**Data:** For one pair of individuals in the study sample, the variables provided contain data about texts between these individuals that have been collected over a period of about three months. The data is stored in these variables in the order in which they were sent. We have data on the day of the month, name of the month, time of the day, time of day it was sent, and the associated emotional tone:

`message_id` is a numpy array with unique identification id's of the text message. This helps anonymize the data and protect the privacy of the study participants. \
`day` is a numpy array with the day of the month. \
`month` is a numpy array with the name of the month. \
`time` is a numpy array with the time of the day. \
`tone` is a numpy array with the emotional tone of the texts. 

Each of these arrays has a special relationship called a **"parallel structure."** This means that the data in each array at a particular index corresponds to the same single text message. For example, the first element in each array refers to the first message, the second element to the second message, and so on. This is important for analyzing the data accurately because it ensures you are looking at the right elements across different arrays that relate to the same text message.

For instance:\
`message_id[0]` contains the unique ID for the first text message. \
Similarly, `day[0]`, `month[0]`, `time[0]`, and `tone[0]` provide the day of the month, the name of the month, the time of the day, and the emotional tone of that same first message.

***How to use this structure in your analysis:***
 - By using the `message_id` as a reference, you can ensure that you are consistently analyzing data from the same message when comparing across the `day`, `month`, `time`, and `tone` arrays.
 - This structured approach allows you to effectively maintain data integrity and privacy while performing data analysis.

Understanding this structure is crucial for performing correct analysis. As you work through the assignment, you will use indexing to navigate through these arrays. This will involve extracting and comparing data across `day`, `month`, `time`, and `tone` to analyze communication patterns and emotional tones based on the timing of the messages.

Run the following cell before moving on to the assignment questions.

In [None]:
import numpy
message_id = numpy.loadtxt("./data/assign04/message_id.txt", delimiter = ",", dtype = str)
day = numpy.loadtxt("./data/assign04/day.txt", delimiter = ",", dtype = int)
month = numpy.loadtxt("./data/assign04/month.txt", delimiter = ",", dtype = str)
time = numpy.loadtxt("./data/assign04/time.txt", delimiter = ",", dtype = str)
tone = numpy.loadtxt("./data/assign04/tone.txt", delimiter = ",", dtype = str)

**Question 1: Basic Description of Data**


**a.** Because of the parallel structure of the data, the length of all the arrays mentioned above are the same.

Store the length of any of the arrays provided in a new variable `data_length`.

In [None]:
### Write your code below (in place of ...)
data_length = ...

In [None]:
grader.check("q1a")

**b.** These arrays contain data from three months. Name the three months.

Store the name of the three months in a new `numpy` array called `unique_months`.

**Tip:** Use the [`numpy.unique()`](https://numpy.org/doc/stable/reference/generated/numpy.unique.html) function.

In [None]:
### Write your code below (in place of ...)
unique_months = ...

**c.**  Similarly, calculate the unique values of times of the day in the array `time`.

Store the unique values of time in a new variable `unique_times`.

In [None]:
### Write your code below (in place of ...)
unique_times = ...

**Question 2: Exploring the data with `if` and `else` statements**

Let us try out some data exploration using `if` and `else` statements.


**NOTE:** Even if the autograder passes your code, if you have NOT used `if` and `else` statement for this questions, all points for this question will be deducted. 

**a.** For the text message at *index* 603, 

 - if the tone of the text message is "positive", make a new variable called `message603` and assign it the string value "It is positive!"
 - if the tone of the text message is negative, assign `message603` the string value "It is negative!" 

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q2a")

**b.** For the text message at *index* 1521:

 - if the time at which it is sent is "evening", and the month is "august", \
   make a new variable called `message1521` and assign it this string: `"This right here is an evening in August."`
 - else, make a new variable called `message1521` and assign it this string: `"Definitely not an evening in August."`

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q2b")

**c.** For the text message at *index* 1000:

 - if the month in which it sent is July, the tone of the message is "positive", and the time of the message is "afternoon", \
   make a new variable called `message1000` and assign it this string: `"I got 999 messages but a sad one on a July afternoon ain't one."`
 - else, make a new variable called `message1000` and assign it this string: `"I got 999 messages."`

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q2c")

**Question 3: Basic Data Segmentation** 

Someone in your data science team has informed you that certain dates within the dataset hold a special significance. Specifically, your team is interested in the dates starting from June 28 and ending on August 5.

Your job is now to find this segment of data from the existing arrays.

![](./imgs/assign02_index_explanation.png)

Let us break this down into multiple steps:

**Step 1:** Since the data is stored in the provided arrays in the order in which they appear, we just need to know the index for the first time a text is sent on June 28 (`start_index`), and the last time a text is sent on August 5 (`stop_index`). Everything between `start_index` and `stop_index` will contain all the data we need.

**a.** Find the starting index and store it in a new variable `start_index`.

Here are the steps:

1. For each index in the data in sequence, check if the value of day at the index is 28 and month at that index is "june"
2. Use a `for` loop to go through each number in the array in sequence. Use a variable `i` to go through each index in range of all possible indices (which can be created by using `range(data_length)`.
3. Within the `for` loop, use an `if` statement to check if the desired values of day and month are matching at the index `i`.
4. The first time these values match, assign `start_index` the value of the variable that you are using for loop with, and then `break` out of the loop.

The skeleton for this exercise is provided below. You just need to do fill in the value of `condition` and the value of `start_index`.

In [None]:
### Write your code below (in place of ...)
start_index = 0
for i in range(data_length):
    condition = ...
    if condition:
        start_index = ...
        break;

In [None]:
grader.check("q3a")

**b.** Find the stopping index and store it in a new variable `stop_index`.

Remember, we are trying to find the LAST index at which the value of `day` is 5 and `month` is "august". \
Thus, an easy way to find the `stop_index`, is the following *same steps* as that for `start_index`, but for August 6, not August 5.\
The first index for August 6 is right after the last index for August 5. \
Thus, one index BEFORE the first index for August 6 would be our desired `stop_index`.

Use the code from Question 3a and find the value of `stop_index`. Copy the code from above and just replace values as needed for this question.

**NOTE:** If a `for` loop with `if` statement is not used for the solution for this question, all points will be deducted.

In [None]:
### Write your code below (in place of ...)
...

In [None]:
grader.check("q3b")

**STEP 2:** The final step! Now, you just need to segment the data and pick out the values that are within `start_index` and `stop_index`. 

REMEMBER! Value at `start_index` AND value at `stop_index` need to be included. Use indexing appropriately!

**c.** For each `numpy` array provided, make new numpy array which only includes the segment of data from `start_index` to `stop_index`.

 - make a new variable `message_id_segmented` for the segmented data from `message_id`
 - make a new variable `day_segmented` for the segmented data from `day`
 - make a new variable `month_segmented` for the segmented data from `month`
 - make a new variable `time_segmented` for the segmented data from `time`
 - make a new variable `tone_segmented` for the segmented data from `tone`


In [None]:
### Write your code below (in place of ...)
message_id_segmented = ...
day_segmented = ...
month_segmented = ...
time_segmented = ...
tone_segmented = ...

In [None]:
grader.check("q3c")

**Question 4.** Alright! We now have the segmented data!

What is the total number of times the texts have been positive in tone in the segmented data?

Save this value in a new variable `positives_in_segmented`.

**Hint:** We did a similar exercise for the previous assignment!

In [None]:
### Write your code below (in place of ...)
...

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Assignment #4 for COMM187: Data Science in Communication Research. Once finished answering questions, first **SAVE** then download this .ipynb file. Submit the file as instructed on Canvas and Gradescope. **ONLY** submit the .ipynb file, not the zip file.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)