Skip to content

Commit

Permalink
case: when only get content
Browse files Browse the repository at this point in the history
  • Loading branch information
j3nnn1 committed Aug 7, 2013
1 parent 1765807 commit 1666993
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions textmining/remove_html_tags/remove_html.rb
Expand Up @@ -2,7 +2,7 @@

require 'rubygems'
require 'sanitize'
require 'CSV'
require 'csv'

terms = Hash.new
terms = {'\xe1' => 'á', '\xe9' => 'é', '\xed' => 'í', '\xfa' => 'ú', '\xf3' => 'ó'}
Expand All @@ -14,15 +14,16 @@ def removeaccent(word, terms)
return word
end

file_clean = File.open("lanacion.com.ar.csv.data.clean", "w")
file_clean = File.open("infobae_finanza.csv.data.clean", "w")

CSV.foreach("lanacion.com.ar.csv.data", encoding: 'UTF-8' ) do |row|
CSV.foreach("infobae_finanza.csv", encoding: 'UTF-8' ) do |row|
#title
title = removeaccent(Sanitize.clean(row[1]).force_encoding('UTF-8'), terms)
#content
content = removeaccent(Sanitize.clean(row[0]).force_encoding('UTF-8'), terms)
#csv
csv_string = [title, content].to_csv
#csv_string = [content].to_csv
file_clean.write(csv_string)
end

Expand Down

0 comments on commit 1666993

Please sign in to comment.