Skip to content

gokul180288/Hanhan_Data_Science_Resources

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hanhan_Data_Science_Resources

helpful resources for (big) data science

DATA PREPROCESSING


FEATURE ENGINEERING

  • Feature Selection: https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+AnalyticsVidhya+%28Analytics+Vidhya%29
  • Why Feature Selection:
  • It enables the machine learning algorithm to train faster.
  • It reduces the complexity of a model and makes it easier to interpret.
  • It improves the accuracy of a model if the right subset is chosen.
  • It reduces overfitting.
  • Filter Methods, the selection of features is independent of any machine learning algorithms. Features are selected on the basis of their scores in various statistical tests for their correlation with the dependent variable. Example - Pearson’s Correlation, LDA, ANOVA, Chi-Square.
  • Wrapper Methods, try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. These methods are usually computationally very expensive. Example - Forward Selection, Backward Elimination, Recursive Feature elimination.
  • Embedded Methods, implemented by algorithms that have their own built-in feature selection methods. Example - LASSO and RIDGE regression. Lasso regression performs L1 regularization which adds penalty equivalent to absolute value of the magnitude of coefficients. Ridge regression performs L2 regularization which adds penalty equivalent to square of the magnitude of coefficients. Other examples of embedded methods are Regularized trees, Memetic algorithm, Random multinomial logit.
  • Differences between Filter Methods and Wrapper Methods
  • Filter methods measure the relevance of features by their correlation with dependent variable while wrapper methods measure the usefulness of a subset of feature by actually training a model on it.
  • Filter methods are much faster compared to wrapper methods as they do not involve training the models. On the other hand, wrapper methods are computationally very expensive as well.
  • Filter methods use statistical methods for evaluation of a subset of features while wrapper methods use cross validation.
  • Filter methods might fail to find the best subset of features in many occasions but wrapper methods can always provide the best subset of features.
  • Using the subset of features from the wrapper methods make the model more prone to overfitting as compared to using subset of features from the filter methods.
  • My strategy to use feature selection
  • Use filter methods in data preprocessing step, before training. Choose the top features for model training.
  • Use wrapper methods during the model training step.

DATA MINING BIBLE


-- correct R code is here: https://github.com/hanhanwu/Hanhan_Data_Science_Practice/blob/master/R_summarize_methods.R

-- NOTE1: When using R to connect to Oracle, as Oracle SQL query requires you to use double quote for Alias, not single quites. Meanwhile, in R dbGetQuery() you have to use double quotes for the whole query. Then you can just use \ in fornt of each double quote for Oracle query. For example, dbGetQuery(con, "select col as \"Column1\" from my_table")

-- NOTE2: When using R to connect to SQL Server using RODBC, the limitation is each handler points to 1 database, therefore, you cannot join tables from multiple databases in 1 SQL Query in R. But! You can use R merge function to do Nature Join (special case of inner join), Left Join, Right Join and Full Outer Join. When I was running large amount of data, R even do joins faster than SQL Server!

-- NOTE3: Because of the limitation of RODBC mentioned in NOTE2 above, sometimes before merging, the existing 2 pieces of data may occupy large memory and there will be out of memory error when you try to join data. When this happen, try this options(java.parameters = "-Xmx3g"), this means change the R memory into 3 GB

  • Simple Example to do joins in R for SQL Server query: https://github.com/hanhanwu/Hanhan_Data_Science_Resources/blob/master/R_SQLServer_multiDB_join.R

  • magrittr, a method replces R nexted functions: https://github.com/hanhanwu/magrittr

  • Challenges of Using R, and Compare with MapReduce

  • Paper Source: http://shivaram.org/publications/presto-hotcloud12.pdf

  • R is primarily used as a single threaded, single machine installation. R is not scalable nor does it support incremental processing.

  • Scaling R to run on a cluster has its challenges. Unlike MapReduce, Spark and others, where only one record is addressed at a time, the ease of array-based programming is due to a global view of data. R programs maintain the structure of data by mapping data to arrays and manipulating them. For example, graphs are represented as adjacency matrices and outgoing edges of a vertex are obtained from the corresponding row.

  • Most real-world datasets are sparse. Without careful task assignment performance can suffer from load imbalance: certain tasks may process partitions containing many non-zero elements and end up slowing down the whole system.

  • In incremental processing, if a programmer writes y = f(x), then y is recomputed automatically whenever x changes. Supporting incremental updates is also challenging as array partitions which were previously sparse may become dense and vice-versa.


CLOUD PLATFORM MACHINE LEARNING


VISUALIZATION


DEEP LEARNING


Industry Data Analysis/Machine Learning Tools




  • Data Analysis Tricks and Tips

-- ENSEMBLE

-- DEAL WITH IMBALANCED DATASET

-- TIME SERIES

-- Segmentation

-- Use Clustering with Supervised Learning




-- In this article, when they were talking about concepts such as Activation Function, Gradient Descent, Cost Function, they give several methdos for each and this is very helpful, meanwhile, I have leanred deeper about BB through the concept of Momentum, Softmax, Dropout and Techniques dealing with class imbalance, very helpful, it is my first time to learn deeper about these

-- From the above article, I have made the summary that I think needs to keep in mind:


DATA SCIENCE INTERVIEW PREPARATION


LEARING FROM THE OTHERS' EXPERIENCES


OTHER

About

helpful resources for (big) data science

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%