In [3]:
import pandas as pd

# Reading Data

Very often you will need to read data from a file. In most cases, this will be a CSV file. A CSV file can be read and placed into a Pandas `DataFrame`. The first row of the file is usually used as a header. Pandas will automatically create column names based on the header, but you can also provide your own column names. You can also read data from other sources such as an SQL database, an Excel file, a JSON file, etc. For more information on reading data, see the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

## Reading CSV files and working with a dataframe
### CSV File format
Data in Data Science is often stored in files with the extension *.csv*. CSV stands for **Comma Separated Value**. This means that the values are separated by commas, and it is a standard that was established in the 1970s. That abbreviation has remained, but nowadays values can also be separated by other characters, such as a Tab or a semicolon (;). That character is also called the delimiter or the separator. In CSV files, strings are usually also placed between quotation marks, especially when they contain spaces (or the separator, for example), so it is clear where the string begins and ends within a CSV.

Decimal numbers can be stored in two ways: with a decimal point or with a decimal comma. If you save a file in a Dutch-language Excel, you will see that a decimal comma is automatically used. In an English-language Excel, it will be a decimal point. Therefore, we recommend, only for this course, not to work with Excel.

Before reading a *.csv* file, you should always first check in a text editor how it is stored. By which character are the values separated? What is the decimal notation? In PyCharm, this can also be done, and moreover, PyCharm offers the possibility to immediately view a CSV file as a real table.

Go to the `data` folder and open a CSV file in PyCharm. You will notice that there are tabs at the bottom that allow you to choose how you
In PyCharm, you can view a CSV file as a table.

### Reading a CSV file

```data = pd.read_csv('datasets/persons1.csv', )```

__READING A CSV FILE IS ONE OF THE MOST COMMON TASKS IN DATA SCIENCE. YOU ARE EXPECTED TO BE ABLE TO DO THIS FLUENTLY DURING EVALUATIONS. MANY STUDENTS LOSE UNNECESSARY TIME OR ARE NOT ABLE TO SOLVE THE QUESTIONS BECAUSE THEY ARE NOT FAMILIAR WITH THIS TASK.__

Go to the data directory and open BicycleWeather.csv in PyCharm. 
a. What is the seperator in this file?
b. Read the dataset into a DataFrame using the read_csv function. Use the sep parameter to specify the separator.


In PyCharm you can view the dataframe in a table format. Go to the data variable in the jupyter tab and click on "View as DataFrame".
You can of course also print the data in a cell.

In [5]:
print(data)

Now, investigate the data with describe(), info() and head() functions.

Select columns Station_name, date and TMAX from rows with index 10 to 20.

Now, read the file again, but this time use the extra parameter names to specify your own column names. Use range(0, 10) as column names.

Look at the first three rows of the data. What problem do you see?

Try to solve the problem by using the header parameter.

## Categorical Variables
We start by creating a Pandas Categorical Series. A Categorical Series is a list of values that all come from a certain category. A categorical variable can take on a fixed number of values, which are usually expressed in strings. A categorical variable represents a nominal or ordinal variable, depending on whether the values in the list have a certain order or not.


Take your **blood type** as an example. Possible values are:

``> O-, O+, B-, B+, A-, A+, AB-, AB+``

From the introduction, we know that blood type is a **nominal variable**.
You cannot perform calculations with these values. There are also examples where there is an order, for example:

Take the degree of agreement as an example. Possible values could be:\
``> none, little, more, most``\
This is clearly an example of an **ordinal variable**.
Sometimes the values of a categorical variable are represented by numbers, but they are still categorical variables. Do not be misled by this.

From the theory of measurement scales, we know that nominal variables can only be compared (using the = operator) and ordinal variables can at most be sorted (using <, >, = operators).  To work efficiently with these values, an index or category is assigned to them. Because this index is a number, it is much faster to find values of the correct category in a large dataset. Run the following code to create a nominal variable blood types.

Run the following code to create a nominal variable blood types.
```python
values = ['AB-', 'O-', 'B-', 'B-', 'A+', 'AB+', 'O+', 'B-', 'B+', 'A-', 'A+', 'AB-']
bloodtype = pd.Categorical(values, categories=['O-','O+','B-','B+','A-','A+','AB-','AB+'])
bloodtype
```

Now run the following code to create an ordinal variable akkoord.
```python
values = ['little', 'more', 'none', 'more', 'little', 'most', 'none']
agreement = pd.Categorical(values, categories=['none', 'little', 'more', 'most'], ordered=True)
agreement
```

When reading a csv you can also specify which columns are categorical. This can be done with the dtype parameter. Create a new DataFrame laptops from the file laptops.csv. Specify that the columns cpuGeneration and brand are categorical. 
_Tip dtype={'col_name':'category', 'col_name2':'category'}_
 As always, first look at the file to see what the separator is. In this case it will also be important to set the decimal argument correctly.

Check the result. cpuGeneration and brand should be of type category and diskspace should be of type float.

In many cases you will want to convert a column to a categorical variable after reading the file into a dataframe. In case of the dataframe laptops, convert cpuType to a categorical. The cpuType has to be 'oridinal' and the categories should be 'i3', 'i5', 'i7'.

Check whether the column has been converted correctly. Make sure you see < between the categories.