# <font color='SEAGREEN'>Day 1 (Part 2)</font>
# <font color='MEDIUMSEAGREEN'>Loading the Data</font>
As we already know, data is an important part of our machine learning program.

Today we will learn how to load the data into python.

## Dataset
To download the data go to https://www.kaggle.com/mdepak/fakenewsnet/version/1 and download the following files:

    - BuzzFeed_fake_news_content.csv
    - BuzzFeed_real_news_content.csv

These files contain the data for the Fake News gathered by researchers from DMML lab at ASU in the "csv" format. "csv" stands for "comma-separate values". We'll use this information later, when we tell the program how to load this data.

PRO TIP: Make sure that the downloaded datasets and this jupyter notebook are in the same directory (folder), else you will have problems later.

Open these two files (with Microsoft Excel) and understand what the data contains.
Try to answer the following questions:

    - What information do these .csv files contain?
    - Which columns do you think are important in classifying them to fake and real news?
    - How many news are collected?


In [None]:
# Write your answers in comments below (1 line, each):
# Answer to Q1.
# Answer to Q2.
# Answer to Q3.

Refer to the paper *"FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media"* and read the *"Dataset Integration"* part to understand how they extracted the content of these articles.

link to the paper --> https://arxiv.org/abs/1809.01286

    - How was the data collected?


In [None]:
# Write your answer in comments below (3-4 line):

### Load the Data
To load the data we should import the needed packages/libraries for our dataset.

In [None]:
import pandas as pd
import numpy as np

Run the following code to load the data you downloaded into DataFrames.

In [None]:
fake_news_content = pd.read_csv("BuzzFeed_fake_news_content.csv", encoding = "utf-8")
real_news_content = pd.read_csv("BuzzFeed_real_news_content.csv", encoding = "utf-8") 

In [None]:
fake_news_content.head()

What do you think head() function do?

In [None]:
# Write down your answer in comments (one line):
#

In [None]:
fake_news_content.iloc[0]

What does .iloc will return?

In [None]:
# Write down your answer in comments (one line):
#

### Layout of the Data
The ``.columns`` parameter of a DataFrame tells us the name of the columns. Run the following cells to examine the column names of the DataFrames we just created.

In [None]:
fake_news_content.columns

In [None]:
real_news_content.columns

Now that we've looked at the columns of the dataset, let's look at the rows. How many rows are in each dataset? We can use the ``.shape`` parameter to tell us about the number of rows in each dataset. Can you guess what the second number, returned by ``.shape``, corresponds to?

In [None]:
fake_news_content.shape

In [None]:
real_news_content.shape

In [None]:
# Your answer:

### Re-organize the data
The data we loaded contains all the information we need, but it has the articles for fake and real news in different dataframes. 

To make the data easier to work with, we'd like to put the information about the articles in one dataFrame.

    - If we use one dataframe for all, how can we differentiate between fake and real news? 
    

In [None]:
# Write down your answer in comments (one line):
#

In [None]:
frames = [fake_news_content, real_news_content]
data = pd.concat(frames)
# data = pd.merge(fake_news_content, real_news_content)

In [None]:
data.shape

Now we need to create the labels for our dataset.
    - What do you think the below two code lines do?
PRO TIP: you can create a code cell below the code and use ``.head()`` function to get the sense of the dataframes.

In [None]:
# Write down your answer in comments (one line):
#

In [None]:
y_fake = pd.DataFrame(1, index=np.arange(len(fake_news_content)), columns=["label"])
y_real = pd.DataFrame(0, index=np.arange(len(real_news_content)), columns=["label"])

Concatenate the ``y_fake`` and ``y_real`` dataframes into one dataframe named ``y_data``. Verify the merged dataframe by calling the ``.shape`` function.

In [None]:
# your code goes here

### Exporting the data
Now that we have created an amalgamated dataset, we'd like to export this, so that we can use it in the future:

In [None]:
data.to_csv("data.csv", index=False, encoding = "utf-8")
y_data.to_csv("labels.csv", index=False)

### Learn More about Dataframes
Search more about the dataframes (in pandas documentation) and find out how you can only access a specific column.

Create a new dataframe that only includes the text of the articles.

In [None]:
# Your code

## Learn More about Types of Data
In machine learning, data is represented in a tabular format. Consider the example of predicting whether an individual who visits an online book seller is going to buy a specific book. This prediction can be performed by analyzing the individual’s interests and previous purchase history. For instance, when John has spent a lot of money on the site, has bought similar
books, and visits the site frequently, it is likely for John to buy that specific book. John is an example of an instance. Instances are also called points, data points, or observations. A dataset consists of one or more instances:

<img src="images/tb1.png">

A dataset is represented using a set of features, and an instance is represented using values assigned to these features. Features are also known as *measurements* or *attributes*. In the above example, the features are Name, Money Features, Measurements, or Attributes Spent, Bought Similar, and Visits; feature values for the first instance are John, High, Yes, and Frequently. Given the feature values for one instance, one tries to predict its class (or class attribute) value. In our example, the class attribute is Will Buy, and our class value prediction for first instance is Yes. An instance such as John in which the class attribute value is unknown is called an unlabeled instance. Similarly, a labeled instance is an instance in which the class attribute value in known. Mary in this Labeled and dataset represents a labeled instance. The class attribute is optional in a Unlabeled dataset and is only necessary for prediction or classification purposes. One can have a dataset in which no class attribute is present, such as a list of customers and their characteristics.

There are different types of features based on the characteristics of the feature and the values they can take. For instance, Money Spent can be represented using numeric values, such as $25. In that case, we have a continuous feature, whereas in our example it is a discrete feature, which can take a number of ordered values: {High, Normal, Low}.

### Different Types of Data

1. **Nominal (categorical)**. These features take values that are often represented as strings. For instance, a customer’s name is a nominal feature. In general, a few statistics can be computed on nominal features. Examples are the chi-square statistic (χ2) and the mode (most common feature value). For example, one can find the most common first name among customers. The only possible transformation on the data is comparison. For example, we can check whether our customer’s name is John or not. Nominal feature values are often presented in a set format.

2. **Ordinal**. Ordinal features lay data on an ordinal scale. In other words, the feature values have an intrinsic order to them. In our example, Money Spent is an ordinal feature because a High value for Money Spent is more than a Low one.

3. **Interval**. In interval features, in addition to their intrinsic ordering, differences are meaningful whereas ratios are meaningless. For interval features, addition and subtraction are allowed, whereas multiplications and division are not. Consider two time readings: 6:16 PM and 3:08 PM. The difference between these two time readings is meaningful (3 hours and 8 minutes); however, there is no meaning to $\frac{6:16 PM}{3:08 PM} \neq 2$.

4. **Ratio**. Ratio features, as the name suggests, add the additional properties of multiplication and division. An individual’s income is an example of a ratio feature where not only differences and additions are meaningful but ratios also have meaning (e.g., an individual’s income can be twice as much as John’s income).

### Example
Study the below table and list the type of each feature. Explain.

<img src="images/tb2.png">

In [None]:
# Your answer:
# 1. Outlook:
# 2. Temperature:
# 3. Humidity:
# 4. Windy:
# 5. Play:

In [None]:
print("Nice work today!")