# Introduction to Data, Datasets, and Basic Python

### What is Data?

Data are individual units of information. In analytical processes, data are represented by variables. Data can include the names of everyone in a community, the height of each player on a basketball team, the life-expectency rate for several different countries, a description of symptoms from a patient over the course of a hospital stay, or the text from several articles about a specific topic. Data is endless, and as more avenues become available for collecting information, traditional data is expanding into BIG data. 

### Structure of a Dataset

![alt text](dataset.png "Anatomy of a Dataset")

### What is a Variable? What is a value/observation?

<b>A variable</b> is a holder or container for a specific value(s). A value can range in type, but regardless of what the value is - the variable holds that information which allows you to call upon it when you need it. Variables can hold one or many values. 

<b>A value</b> is information or data. This can include a number, a sentence, a symbol, an address, etc. The list is endless. Sometimes values are complex and messy, and therefore they are better organized and manipulated when assigned to a variable. 

There are <b>three</b> main types of values we will be covering in this class... int, float, and string. 

- An int value is an integer (whole numbers)
- A float value is a decimal number
- A string value is alpha-numeric characters that cannot be mathematically computed; can use single or double quotes

### Basic Python operations with variables 

In [None]:
# assigning a value to a variable 
# variable name = data 

x = 5
y = 3.5
z = "Hello"

In [None]:
# print function
# print out the values of variables | print(variable name)

print(x)
print(y)
print(z)

In [None]:
# type function
# determine the data type | type(variable name)

type(x)
type(y)
type(z)

In [None]:
# determine the data type of all variables 

print(type(x))
print(type(y))
print(type(z))

In [None]:
# change a variable to a different type

x2 = str(x)

print(x2)
print(type(x2))

In [None]:
# lists = a container for multiple values
# list of values, each value within the list can be called and manipulated (more on that later)

colors = ['red','yellow','blue']
mixed = ["green",17,"hippo"]

print(colors)
print(mixed)

# Importing data into Python environment

#### What is a library?

Libraries have pre-defined code for other functions that are not included in basic Python. Once a library has been imported, any of its functions can be used throughout the entire notebook.

In [None]:
# importing pandas library

import pandas as pd

In [None]:
# importing dataset

df = pd.read_csv("pokemon.csv")

# Checking your imported data

In [None]:
# checking dataset with head - view the first 5 rows

df.head()

In [None]:
# checking dataset with tail - view the last 5 rows 

df.tail()

# Data attributes 

In [None]:
# looking at the information about your dataset with .info()

df.info()

In [None]:
# looking at summary startistics with .describe()

df.describe()

# Isolating data from your dataset 

### What is indexing? What is slicing?


Indexing data is a method that you can use to select specific rows or columns from your dataset. What if you just want to look at a few columns from a huge dataset? Or if you just want to look at a certain subset of rows?

You can index certain items to get a section of the data you want. Additionally, slicing data gives you a chunk of data from one value up to another value - ultimately, you're taking a slice of the larger dataset. 

In [None]:
df.head(1)

In [None]:
# selecting columns

df.Name

In [None]:
# selecting columns - another way!

df[["Name"]]

In [None]:
# selecting multiple columns

df[["Name", "Type 1", "Stage"]]

### iloc operation: position-based indexing 

Indexing in Python is zero-based, meaning numbering starts with 0. When you are referencing a object that is within a specific number position, you should always be aware of how the numbering is organized. For example, the first column ("Num") is in position "0", the second column ("Name") is in position "1", etc.

when you use <b>iloc</b>, you are selecting rows and columns based on the index. 

In [None]:
df.head(2)

In [None]:
# selecting rows with iloc

df.iloc[0]

In [None]:
# selecting multiple rows with iloc
# inclusive : exclusive (up to but not including the last number)

df.iloc[0:5]

In [None]:
# selecting non-consecutive rows with iloc

df.iloc[[0,17,48]]

In [None]:
# selecting rows from the end of your dataset

df.iloc[[-10]]

In [None]:
# selecting columns with iloc
# df.iloc[row index, column index]

df.iloc[0,1]

In [None]:
# selecting multiple columns with iloc

df.iloc[0,0:4]

In [None]:
# selecting non-consecutive columns with iloc

df.iloc[0,[1,2,11,12]]

In [None]:
# selecting all rows for specific columns 

df.iloc[:,[1,11,12]]

In [None]:
# selecting muliple rows and muliple columns

df.iloc[[0,17,48],[0,1,4,11,12]]

### loc operation: label-based indexing 

When you use <b>loc</b>, instead of using indexing, you will reference column and row labels to identify the sections of data that you want. There are some limitations to loc that will be made evident below. 

In [None]:
# re-import Pokemon dataset
# modifying default index with importing data

poke = pd.read_csv("pokemon.csv", index_col = "Name")

poke.head(25)

In [None]:
# selecting rows with loc - unique value

poke.loc["Pikachu"]

In [None]:
# selecting rows with loc - nonunique value
# adding iloc to get a specific subset 

In [None]:
# selecting multiple rows with loc

poke.loc[["Metapod","Weedle","Charmander","Mew"]]

In [None]:
# selecting rows and columns with loc

poke.loc[["Metapod","Weedle","Charmander","Mew"],"Type 1"]

In [None]:
# selecting rows and multiple columsn with loc

poke.loc[["Metapod","Weedle","Charmander","Mew"],["Type 1","Stage","Legendary"]]

In [None]:
# slicing data with loc
## inclusive : inclusive 

poke.loc["Bulbasaur":"Squirtle"]

In [None]:
poke.loc["Bulbasaur":"Squirtle",["Type 1","Stage"]]

In [None]:
# slicing with nonunique values

poke2 = pd.read_csv('pokemon_dupe.csv', index_col="Name")
poke2.head(20)

In [None]:
poke2.loc["Weedle"]

In [None]:
poke2.loc["Weedle":]

In [None]:
# slicing with rows or columns that are not defined 

poke.loc[["Bulbasaur","Squirtle","Donald Duck"],["Type 1","Speed","Type 100"]]

### Creating subsets of original dataset

In [None]:
df.head()

In [None]:
# assigning selected columns to new variable

new = df[["Name", "Type 1", "Type 2", "Stage", "Legendary"]]

new.head()

In [None]:
# assigning sliced rows and columns to new variable 

new = df.iloc[0:25, [1,2,3,11,12]]

new

# Exporting data outside of Python environment 

In [None]:
# exporting data from Python environment to .csv file 

new.to_csv('new_pokemon.csv',index=False)

In [None]:
# check that file was exported

isNew = pd.read_csv("new_pokemon.csv")

isNew