# Basic data scientist skills

### What does it mean for a data scientist to have 'substantive expertise', and why is it important?

- Knows which questions to ask 
- Can interpret the data well
- Understands structure of the data
- Data scientist often works in teams
--> to problem solving

데이터분석가의 기본 역량은 문제 해결을 위해 질문을 할 줄 알고, 데이터를 잘 해석할 수 있고,  데이터 구조를 이해해야 한다. (또 주로 팀으로 일하기 때문)

![The Data Science Cenn Diagram](https://static1.squarespace.com/static/5150aec6e4b0e340ec52710a/t/51525c33e4b0b3e0d10f77ab/1364352052403/Data_Science_VD.png?format=300w)

source : http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram



## Intro to Numpy and pandas

#### Numpy

- Multidimensional arrays + Matrices
- Mathematical functions useful for statistical analysis ; Mean, Median, Standard deviation

#### Pandas

- Handle data in a way suited to analysis
- Similar to R



# Numpy

In [17]:
import numpy as np

# Mean, Median, Standard Deviation
numbers = [1,2,3,4,5]

print ('Mean: ', np.mean(numbers))
print ('Median: ', np.median(numbers))
print ('Standard Deviation: ', np.std(numbers))


Mean:  3.0
Median:  3.0
Standard Deviation:  1.41421356237


In [34]:
import numpy as np

# Numpy arrays
array = np.array([1, 4, 5, 8], float)
print (array)
array = np.array([[1, 2, 3], [4, 5, 6]], float)  # a 2D array/Matrix
print (array)
print ("")

# index, slice, and manipulate a Numpy array
array = np.array([1, 4, 5, 8], float)
print (array)
print (array[1])
print (array[:2])
array[1] = 5.0
print (array[1])
print ("")

two_D_array = np.array([[1, 2, 3], [4, 5, 6]], float)
print (two_D_array)
print (two_D_array[1][1])
print (two_D_array[1, :])
print (two_D_array[:, 2])
print ("")

# arithmetic operations
array_1 = np.array([1, 2, 3], float)
array_2 = np.array([5, 2, 6], float)
print (array_1 + array_2)
print (array_1 - array_2)
print (array_1 * array_2)
print ("")

# mean and dot product
array_1 = np.array([1, 2, 3], float)
array_2 = np.array([[6], [7], [8]], float)
print (np.mean(array_1))
print (np.mean(array_2))
print (np.dot(array_1, array_2))


[ 1.  4.  5.  8.]
[[ 1.  2.  3.]
 [ 4.  5.  6.]]

[ 1.  4.  5.  8.]
4.0
[ 1.  4.]
5.0

[[ 1.  2.  3.]
 [ 4.  5.  6.]]
5.0
[ 4.  5.  6.]
[ 3.  6.]

[ 6.  4.  9.]
[-4.  0. -3.]
[  5.   4.  18.]

2.0
7.0
[ 44.]


# Pandas

In [41]:
import pandas as pd

# concept of series in Pandas 
series = pd.Series(['Dave', 'Cheng-Han', 'Udacity', 42, -1789710578])
print (series)
print ("")

# manually assign indices to the item in the Series
series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                    index=['Instructor', 'Curriculum Manager','Course Number', 'Power Level'])
print (series)
print ("")

# use index to select specific items from the Series
series = pd.Series(['Dave', 'Cheng-Han', 359, 9001],
                    index=['Instructor', 'Curriculum Manager','Course Number', 'Power Level'])
print (series['Instructor'])
print (series[['Instructor', 'Curriculum Manager', 'Course Number']])
print ("")

# boolean operators to select specific items from the Series
cuteness = pd.Series([1, 2, 3, 4, 5], index=['Cockroach', 'Fish', 'Mini Pig', 'Puppy', 'Kitten'])
print (cuteness > 3)
print("")
print (cuteness[cuteness > 3])


0           Dave
1      Cheng-Han
2        Udacity
3             42
4    -1789710578
dtype: object

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
Power Level                9001
dtype: object

Dave
Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
dtype: object

Cockroach    False
Fish         False
Mini Pig     False
Puppy         True
Kitten        True
dtype: bool

Puppy     4
Kitten    5
dtype: int64


### Pandas useful functions for Dataframe

- dtypes: to get the datatype for each column
- describe: useful for seeing basic statistics of the dataframe's numerical columns
- head: displays the first five rows of the dataset
- tail: displays the last five rows of the dataset

In [45]:
import numpy as np
import pandas as pd

# Dataframe
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions', 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
print (football)
print ("")
print (football.dtypes)
print ("")
print (football.describe())
print ("")
print (football.head())
print ("")
print (football.tail())

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012
5      10    Lions     6  2010
6       6    Lions    10  2011
7      12    Lions     4  2012

losses     int64
team      object
wins       int64
year       int64
dtype: object

          losses       wins         year
count   8.000000   8.000000     8.000000
mean    6.625000   9.375000  2011.125000
std     3.377975   3.377975     0.834523
min     1.000000   4.000000  2010.000000
25%     5.000000   7.500000  2010.750000
50%     6.000000  10.000000  2011.000000
75%     8.500000  11.000000  2012.000000
max    12.000000  15.000000  2012.000000

   losses     team  wins  year
0       5    Bears    11  2010
1       8    Bears     8  2011
2       6    Bears    10  2012
3       1  Packers    15  2011
4       5  Packers    11  2012

   losses     team  wins  year
3       1  Packers    15  2011
4       5  Packers